NVIDIA Keynote at AI Infra Summit 2025: Advancing Innovation in AI infrastructure

[00:00:00] I am delighted to welcome to the stage Ian Buck, VP of Hyperscale and High Performance Computing [00:00:05] at NVIDIA for his keynote on Advancing Innovation in AI Infrastructure. Welcome, Ian. [00:00:11] Well, good morning, everyone, and thank you to YJ. We spent a lot of time together. Building [00:00:15] AI infrastructure is a challenging and brave endeavor, and building infrastructure that works [00:00:24] not only a single rack, but an entire data center with that much compute to keep all those [00:00:29] researchers active and busy and humming along is a challenge. And of course, as you know [00:00:34] now, we're actually building those data centers, the next generation data centers, and every [00:00:38] year we have a new platform, and that's just the pace at which AI is going. So here at the [00:00:44] AI Inference Summit, I wanted to talk through some of the perspectives of how NVIDIA thinks [00:00:48] about this market, particularly around inference, which is a highlight and a focus of so many [00:00:53] of us today. You know, inference is a pretty complicated landscape. Often people think of [00:01:01] training as being hard. Certainly it's the 100,000 GPUs that we just talked about, but inference [00:01:08] itself has so many vectors of optimization and tradeoffs people have to consider. There's the [00:01:14] intelligence, how big of a model I want to deploy. It goes directly to the value of the model, but a bigger [00:01:19] model costs more to run. The responsiveness of the model, how much compute should I put behind a [00:01:25] particular model to make it more interactive, faster, more tokens per second for the user, or more tokens [00:01:31] per second for my entire data center, are tradeoffs that we can make on how much infrastructure, how much [00:01:37] revenue versus how much experience I want to provide for every user. We have cost. Obviously, there's a [00:01:42] litany of different hardware platforms, configurations from H100 to B200 to GB200 with GB300 and picking the [00:01:51] right cost and the infrastructure you want to play into it and to build and run those models. [00:01:57] There's a tradeoff between throughput and tokens per second per user. I can serve a model with incredibly [00:02:04] fast if I throw a massive number of GPUs at one query, but my data center generates revenue for all my [00:02:11] queries. In that tradeoff, the more I'm spending on one person's queries, the fewer total queries I can [00:02:18] do for all the users. We have to think about our tradeoffs in that Pareto. And finally, energy [00:02:23] efficiency. Our data centers are measured not necessarily in square footage, but in megawatts or [00:02:30] gigawatts. So understanding how much energy efficiency I can operate for a single model matters. All of these [00:02:37] tradeoffs are tradeoffs we have to make in our infrastructure and what we decide to build and also [00:02:44] our roadmap and how we innovate and think about the future. Hardware takes a long time to build. [00:02:49] Silicon takes even longer. We have to think about all those tradeoffs and project out a year or two [00:02:55] years ahead in AI, which is nearly impossible. One of the most challenging parts of my job to think about [00:03:01] these tradeoffs and make the right bets. At NVIDIA, another way of thinking about it is looking at [00:03:08] this acronym which we call SMART. First, you have to think about the scale and complexity. What is the [00:03:15] scale and complexity of the infrastructure you're building and how can we make it scale and scale more [00:03:19] efficiently? We have to look at that multi-dimensional performance. Every decision, we'll have to look at the [00:03:24] different axes of how much intelligence, performance, throughput, cost, energy efficiency, and how those [00:03:31] different metrics weigh into the final end solution for an in-app factory. In the end, it becomes a full [00:03:38] stack solution. You have the chip architecture, the RAC, the node architecture, the RAC architecture like you [00:03:45] saw, but then the multiple layers of software and software optimization that go on top. All of that comes [00:03:51] together to the solution. It's hardware plus software that creates the total end throughput and performance. [00:04:00] In the end, that delivers ROI. In inference especially, performance equals revenue. I'll give you some [00:04:08] examples of how the math works there, but performance really is paramount to having the generating tokens, [00:04:15] to getting those queries processed. It is a performance to revenue where training tends to be about [00:04:23] capability per cost. Finally, all that is happening not in a vacuum, but as a community. All of you here [00:04:31] who will help building the AI infra of the future, many of you representing the companies that are building [00:04:36] data centers or building hardware or building RACs, connecting them all together and serving it, [00:04:42] but also the open software community. YJ talked about OCP, open source software like OpenAI Triton, [00:04:48] PyTorch, and the many other software stacks. All of these innovations are coming from the community. [00:04:54] Unlike many of the other computer revolutions of the past, AI is particularly open where researchers and [00:05:03] companies are actually openly sharing their ideas, publishing their results, blogging about how to [00:05:09] advance the future. And that's because it's a rising tide. As we all do that, we accelerate AI, [00:05:15] we increase the opportunity for more AI, more models, more capabilities, which feeds a virtuous cycle. [00:05:23] And NVIDIA, our contribution, again, is to continue to look at ways that we can increase the performance [00:05:30] per watt per dollar. And as a result, turn it into profit for an AI data site. The most recent innovations [00:05:38] and focus was around NVLink 72. We took what was originally a server box that had eight GPUs that were [00:05:46] NVLink connected of H100 or H200 or B200 and disaggregated it. We moved NVLink out of the box and made it rack scale. [00:05:55] An incredibly challenging endeavor. It built some amazing infrastructure that's being deployed now and [00:06:00] being used now. We had to think and rethink about how we do ethernet scale. We want to scale up beyond the [00:06:08] traditional 10, 20, 50,000 GPUs connected with InfiniBand to the hundred to million GPUs that want [00:06:16] to be connected with ethernet scale infrastructure. A very different kind of ethernet. An ethernet where [00:06:21] every GPU or every node wants to talk to every other node at full performance. This is not what [00:06:27] ethernet was designed to do. It was designed to be able to connect clients with servers and have a [00:06:35] usage case pattern where it isn't everything, talking everything all the time. That required a new kind [00:06:40] of ethernet and a new kind of switch. And that's the basis for our Spectrum X ethernet platform. [00:06:46] Numerical methods. AI is a statistical problem. It's really just giving it data and producing a [00:06:54] prediction. And using techniques, statistic techniques combined with numerics, can open up the space of [00:07:03] numerical representation. Everything started with FP32, 32-bit floating point, we all know. Google, [00:07:10] 10 years ago, did BFloat16. IEEE has now added FP8, which we support going all the way down now to [00:07:18] 4-bit floating point. In fact, there's multiple 4-bit floating point formats. And recently we've been in [00:07:24] Blackwell bring brought to market the NVFP4 format, which actually has a micro-tensor scaling to try to keep all [00:07:30] these math, the entire AI operating in just 4 bits. To keep that statistically in range without things [00:07:39] capping, we're constantly having hardware capabilities to keep the calculation totally [00:07:45] in scale with constant biasing and re-biasing of the calculation. NVFP4 is being used for inference, [00:07:53] and actually more recently, we've been able to figure out the numerical algorithmic techniques to do NVFP4, [00:07:58] 4-bit floating point for training as well. This was recently announced in Hotchips. [00:08:04] Doing the math is easy, doing the numerics, incredibly complicated. Software is a big part of it. Of course, [00:08:12] NVIDIA is just one contributor to the software ecosystem, specifically released back at GTC, [00:08:19] announced NVIDIA Dynamo, and I'll talk more about that. Our ability to take inference and disaggregate [00:08:24] it across all those servers to be able to do context processing and generation and really scale up the [00:08:31] number of GPUs we can assign to a single model requires an entirely new kind of software stack for [00:08:38] serving models. In the end, this results in ROI. We have Blackwell delivers about 10 times the return on [00:08:46] investment. This is comparing the amount of dollars you spend on infrastructure to how much you can [00:08:52] generate in revenue in your AI factory. And of course, none of this happens alone. I manage teams that work [00:08:59] on PyTorch with YJ, also JAX and other software stacks like OpenEye Triton, but also for [00:09:07] inference, we have SGLang, VLM. NVIDIA has their own inferencing libraries called TensorFlow TLM. [00:09:14] All this is happening in a community, an open community, and by contributing and supporting [00:09:19] and providing the foundational software for these technologies or allowing researchers and developers or [00:09:25] companies and clouds to be able to use it, we can elevate and bring all these technologies to market [00:09:31] kind of in real time at a pace of, well, accelerating pace. It's quite amazing. [00:09:43] Performance is a difficult thing to measure, particularly in the inference. And, you know, [00:09:48] it's easy to make a claim, but how accurate is the model? I can take a model and I can run it and reduce [00:09:55] precision. But if my precision isn't, if I'm not delivering the same accuracy in a reduced precision [00:10:04] format and running faster, I might as well be running a cheaper model. MLPerf is a benchmark that's been [00:10:10] around for quite some time. In 2019, they started MLPerf Inference. This is where Google, NVIDIA, [00:10:19] Meta, and many other companies in the industry come together, create a formula where we peer review each [00:10:25] other's results to run benchmarks and actually prove that we can actually deliver a model with a certain [00:10:31] amount of performance, with a certain token rate for a particular user, and agree upon that this is the [00:10:37] established standard for measuring inference. Since 2019, NVIDIA has been submitting and running on all [00:10:45] the benchmarks going back to the Ampere architecture through Hopper to Blackwell. And, you know, [00:10:49] we've been continuously, every year, you know, delivering new benchmarks and rising the performance [00:10:58] of AI models and inference. We hold every single per-GPU record since the MLPerf [00:11:05] data center benchmark started. In fact, this last round, which just gets announced today, I think [00:11:10] later on, you'll hear from David Cantor, we've recently added DeepSeaGAR1, the new LAMA 3.1405b and [00:11:17] LAMA 2 benchmarks were announced just recently on the new Blackwell Ultra or GB300 platform. [00:11:24] This was used, we used the NVFP4, we used the Dynamo software, we used the new TensorRT, and we fully [00:11:31] distributed the inference across the NVLink of the entire GP300 bracket. In fact, [00:11:39] the software to do all that is quite complicated. And, literally, since we've launched Blackwell [00:11:45] last year to today, we have actually doubled the performance of Blackwell just with software. [00:11:51] It's the complexity of how you distribute the communication, apply the new numerical techniques [00:11:57] to improve performance, reduce cost, and get better throughput on MLPerfs in 2x in software on the same [00:12:04] Blackwell totally for free. The same story played out in Hopper. We actually quadrupled Hopper's [00:12:10] performance over its lifetime on the same Hopper four times faster just through software and applying [00:12:18] all the optimizations by NVIDIA engineers, but more importantly by the open source community and [00:12:23] researchers were figuring out new ways to serve the same models at the same grade of accuracy, [00:12:29] just faster, cheaper, and generating more revenue. [00:12:35] This is kind of the math that is today. Actually, a $3 million GB200 and NVL72 rack actually will [00:12:42] generate about $30 million in token revenue to the point where actually a free GPU is not even cheap [00:12:48] enough. If you kind of compare like, let's say, a previous generation or alternative platform with [00:12:54] a quarter of the performance. You can see at the bottom there, the gray is actually the GPU cost and [00:13:00] also the shell and server cost that all goes around it. If I even delete or remove the GPU cost, [00:13:07] you can see that the revenue we generate over the course of multiple years is demonstrable. This is [00:13:15] how we feel about inference. The performance of the platform is the revenue of an AI factory, [00:13:22] and Blackwell literally delivers a 10X on what you buy and what you make. [00:13:30] So let's talk a little bit more about inference and how it works and where the innovations are [00:13:34] happening right now in inference and what we're focused on. Inference is actually two different [00:13:38] workloads. On the left, on the right here, you get a query that comes in and you do traditional [00:13:44] inference serving. So then the first thing the AI model does is all the context processing. This is [00:13:50] literally the question that you may ask a chat GPT or chat bot, but also all the other tokens that come [00:13:56] in that are unique to you or the system prompt. So these are things that you've asked in the past [00:14:02] or things that are natural to your query that should assist the AI in answering the question [00:14:07] that it's pulling from databases. It doesn't just look at the of your question, but all the other input [00:14:13] tokens. That's called the context and pre-fill phase. And then after it's processed all the query [00:14:19] and all the related data, all the input token, then it actually starts outputting tokens that you're [00:14:24] reading. And that's the generation phase of decode. Typically, we do this over a cluster of GPUs, [00:14:30] depending on the model. It might be four GPUs, eight GPUs, or even multiple GPUs, depending on the model size [00:14:38] and the performance, but it's generally running one model across one set of GPUs. What's interesting, [00:14:44] though, is that the context processing and the generation are actually different. They're both [00:14:47] running the same model, but the context processing can be done in a massively parallel way. We can [00:14:52] process all the input tokens in parallel and where the generation of AI tends to be auto-regressive. [00:14:59] Every token gets outputted. You have to run it again to calculate the next token, to calculate the next [00:15:03] token in an auto-regressive way, where processing 16K, 32K, even 100K input tokens can be processed in [00:15:12] parallel. As a result, you actually have this performance delta where the content is very compute-rich [00:15:20] heavy because we can do a massively parallel. We can all do it at once, where the generation decode, [00:15:24] because it's auto-regressive, needs a combination of memory bandwidth, NV-link bandwidth, and compute in [00:15:30] order to fastly output numbers of tokens. If we stick on the same platform or one GPU, we end up having [00:15:37] to pick the best of both worlds, but sort of in the middle somewhere, but not necessarily the optimal [00:15:42] for these two workloads. Today, most modern data centers actually do disaggregated inferencing. So they [00:15:49] actually take that input query and they generate the context processing on separate GPUs, creating what's [00:15:54] called the KB cache, basically up with just the first token, and then it hands off the KB cache to another [00:16:00] set of GPUs, which are optimized for generation. This allows us to actually split the number of GPUs within [00:16:06] for context versus generation, and actually dramatically improve the overall performance. [00:16:13] The NVIDIA Dynamo software is designed to do this. It's all open source. You can go check it out on GitHub. [00:16:19] All of our development is GitHub first, so you can see live check-ins. But by doing this optimization, [00:16:25] we can actually configure those GPUs and run the model with different AI kernels in a parallel compute [00:16:33] context preview way. And then for the auto-regressive part, configure it with different kernels, [00:16:37] different parallelization techniques for the fast auto-regressive. This overall increases the total [00:16:43] throughput, the same number of GPUs. And in fact, that will generate a six times improvement. [00:16:48] Just for like LAMA models alone, it's about two to four. You can work from two to four X faster. [00:16:54] Same number of GPUs, just doing disaggregated. It's a lot harder because you have to have two sets of [00:16:59] basically workloads running in parallel on the system, and also having this KB cache transfer [00:17:05] between the two platforms and keeping everything busy. [00:17:10] We have, this is in production today. There's a company called Base10, which is an inference [00:17:16] aggregator, basically a model-serving company. They have over 8,000 GPUs of both Hopper and Blackwells [00:17:23] spread around, multiple clouds, including Google Cloud. They actually, when GPT OSS first launched, [00:17:30] they had the fastest inference performance of any cloud provider because they super optimized, using NVIDIA [00:17:38] Dynamo, that split between the context processing and the output generation. [00:17:43] This is an important example of how much software matters, both and how it combines with the [00:17:48] infrastructure of a rack like NVL72 and GB200. Overall, disaggregation gives you about six times faster [00:18:00] first token on models like QN. We're seeing 3X higher, faster token output on models like DeepSeq, [00:18:07] and basically turning inference into sort of a data center or inference scale kind of problem. [00:18:16] In addition to this, we're seeing that context processing becoming more and more important and [00:18:23] higher and higher value. Most of the models you see today can accept up to about 256,000 input tokens, [00:18:31] and there's roughly, you know, two to three tokens per word. So you can kind of get a sense of how much [00:18:37] input token they can consume when you ask a question to a typical chatbot. But there's this slice of [00:18:43] workloads that actually really love having super long input tokens. Two examples of that is advanced [00:18:49] coding. We've heard of coding chatbots basically to help you write code. Advanced coding chatbots take the [00:18:57] entire program and allow and use the AI to add new functionality. Instead of helping you write like a [00:19:03] little loop of code or fix fine bugs, advanced AI codes can take 100,000 lines of code or a million input [00:19:11] tokens of code. And actually be able to output new functionality, entire code blocks, entire portions [00:19:16] of the application to turn the AI really into a software agent that can interact with a software [00:19:22] developer in a totally different way. But you need to be able to process literally millions of tokens, [00:19:27] but the value is incredibly high because now I have basically a software developer that can amplify my [00:19:34] entire software developer workforce by like 10x because the AI is actually generating the initial [00:19:40] code that the developer can all work from at that kind of functionality scale. The other use case [00:19:45] that's really hot right now is video processing generation. Think of processing like an hour of HD video [00:19:51] and producing new video content that's generated videos, a lot of data, millions of tokens. Today the video [00:19:58] generation market is about 4 billion dollars for AI video and by starting the next decade it's projected [00:20:06] to be over 40 billion dollar market. This is in the entertainment space and also in the media and [00:20:14] marketing and advertising space. One way to think about it is you know we used to live in a world where [00:20:20] when we came home and watched on our TVs you know it was whatever was on the TV. We moved to the digital era, [00:20:26] now we have on demand, we can watch whatever we want. And by the end of the decade we're basically [00:20:31] going to be on interactive media. It won't be whatever on demand we want, but all the interaction [00:20:38] we could do for entertainment could be interactive and of course we've done through video. So having [00:20:44] these long context capabilities is really interesting and whenever we see this kind of opportunity at [00:20:51] Nvidia where there's a high value market with a place where we're pushing the limits like how big our input [00:20:59] context is, it's an opportunity for us to optimize further. So maybe there's a way we can actually [00:21:04] process or work on these high value large context instead and maybe not use the same GPUs for context and [00:21:12] generation but focus on bringing these large market, these new capabilities to market. And that's why [00:21:19] we announced this morning and here specifically at the at this AI conference a new kind of Ruben processor [00:21:28] dedicated for long context processing. This is the Ruben CPX GPU. It is a GPU specifically built for massive [00:21:38] context length processing for these high value use cases of million scale token [00:21:47] processing. It's specifically optimized for context processing and still of course CUDA capable. [00:21:54] This is a new Ruben GPU which we haven't disclosed or talked about before based on the same Ruben [00:21:59] architecture but a new instantiation. It has over 30 petaflops of NVFP4 and all CUDA capable. We've actually tripled [00:22:08] down on the attention processing. Attention is the building block of many of the models we we have [00:22:14] today and we actually have added new attention acceleration cores to to this chip which is three times faster than [00:22:23] what we have in the current GB300 GPU. Is memory optimized? So a lot of the compute processing of [00:22:30] context is compute rich. It's less dependent on HBM bandwidth or memory bandwidth and less dependent on [00:22:37] having a V-link scalability. So we can use the standard GDR7 memory that we use today and most of the GPUs [00:22:45] that are available in the market. And of course we doubled down on video. So we added four NVIDIA [00:22:53] video encoders and four decoders for processing and generating AI video content. And this will come [00:23:00] online in the end of 2026 right after our initial launch and availability of NVIDIA Ruben. So how do you [00:23:10] integrate this processor, this chip, this single die Ruben into the Vera Ruben rack? So here is Vera Ruben. [00:23:18] We announced this at GTC this year. This is it has over 3.6 exaflops of AI performance in a single [00:23:27] rack. This is coming to be available in the second half of 26. As you can see there's the compute tray. [00:23:36] Each tray has four Rubens, two Vera CPUs and Connect X9 for the scale out interconnect. It's a pretty [00:23:43] impressive platform. It's got over 3.3 times more compute in a rack than GB300 which is deploying today. [00:23:52] It will have 75 terabytes of fast memory and 1.4 petabytes of HBM4 and quite an impressive rack in [00:24:01] itself. It all sits in the same rack architecture as GB300. Hopefully to help YJ and others here to [00:24:07] deploy it actually will fit in the same mechanical and space. And as you can see 72 GPUs are packaged [00:24:15] GPUs in a single rack. This is actually a dual die GPU so it's 144. That's why we call it NBL 144. [00:24:21] Ruben's in a single rack. But let's talk about CPX. We can actually just add CPX to this platform. [00:24:27] In fact right down the bottom we have areas where we can insert additional the context processors and [00:24:33] really boost this rack up for million scale token processing. This is the Vera Ruben NBL 144 CPX. [00:24:43] All we have done here is taken the same tray the same architecture but we've inserted eight of our Ruben [00:24:50] CPXs behind the Vera in connection with the ConnectX9s and those processors are available for the entire [00:24:58] rack to do context processing. And we just totally boosted up the performance of the rack. As you can see [00:25:05] now we're up to eight exaflops or seven half times what GB300 can do today. We've increased our memory [00:25:11] again to 1.7 to 3x. Our fast memory has increased further to 100 terabytes and again all this will fit [00:25:19] nicely into the existing rack infrastructure so that for customers that want to prioritize Vera Ruben for [00:25:24] million token input contexts this is a seamlessly way to upgrade or integrate into their data centers. [00:25:33] We also don't have to put the CPXs in the tray. We'll also be making a CPX only compute tray version. [00:25:38] And in fact customers can actually just put it as a side a side card to their to their very ribbon [00:25:44] rack. This is a on the left there's a new tray called VR CPX. As you can see you have two various CPUs [00:25:50] and eight CPX processors connected with the same networking behind the scenes and they can add a VR CPX [00:25:58] rack in their data center side by side either one to one one to end whatever their mix between their context [00:26:04] processing or and their output generation where all the context processing is happening on the CPX rack [00:26:10] and all the generation token generation can be happening on VR and of course one to one ratio is fine they [00:26:16] can mix it to two to one or they can start with some and expand later all that makes it possible you [00:26:22] don't have to have them next to each other. The way context generation works is as soon as you have your [00:26:26] first token you just need to send that KV cash to your to your token generators wherever they are in your [00:26:32] data center. Quite an upgrade and and running really fast. We've already been working with some of the [00:26:39] Lighthouse customers who are super interested in long context. These are different AI innovators we've all [00:26:45] heard of Cursor who is probably one of the leaders in intelligent code generation and NVIDIA uses Cursor along [00:26:53] with many others and this will help them get to the next level of development productivity with those [00:26:58] million to input token code generators. Magic actually is a magic.dev has a hundred million token [00:27:08] input model quite impressive and we're working with them to figure out how to get that working on CPX [00:27:14] along with runway and uh which which is a company which generates cinematic video and other uh leading [00:27:22] inference providers like fireworks and together AI who have some of the most advanced techniques for [00:27:28] the fastest model model serving and how they can get to that next level of million million token inference. [00:27:38] So we've added another chip to our roadmap uh you can see here we have on the not in Blackwell [00:27:44] we have the the Blackwell and Blackwell ultra for the gray CPU uh our NVLink switch chip the [00:27:52] spectrum 5 uh switch and of course the CX8 NIC all of these chips come together to make AI and AI infra [00:28:00] work it's never just one chip it's a family and now with Rubin we've added the CPX processor a different Rubin GPU [00:28:09] dedicated to uh and optimized for context processing that'll be paired and matched with Rubin for the [00:28:16] one the million scale context processing and fits nicely into the full family and of course that'll [00:28:23] extend and look forward to talking more about Feynman when uh when uh when we get a little bit closer. [00:28:30] All this has to come together the AI is not served and data centers aren't built with one processor they're [00:28:37] connected machines they require CPUs they require GPUs they require various levels of accelerating the [00:28:44] network and infrastructure at scale is needs to work as one in order to serve these models and bring that [00:28:50] token value and that revenue that inference will generate all together in one and that's what we're [00:28:57] focused on at NVIDIA is bring all that infrastructure and the baseline software stack to market as quickly [00:29:03] as we possibly can. One challenge of course is how do we build those future data centers showing you a [00:29:11] lot of racks we've shown a lot of chips um the you know the next challenge of course is can what is that [00:29:18] future data center going to look like YJ talked about how the CPU data centers have evolved and how they're so [00:29:24] different and you know uh we're NVIDIA is also a huge proponent of open standards or members of OCP we've [00:29:32] contributed the GB 200 rack to OCP and we'll do and we'll do so for the upcoming uh infrastructure as [00:29:39] well but the problem now is becoming a data center scale one how can we build and provide and working [00:29:44] with the community a data center roadmap not just a rack and GP roadmap that's going to be future designed for [00:29:51] the future allows us to scale and grow it's pushing the limits of power generation mechanical plumbing [00:29:58] electrical you know bus bar design row length cdus and how are all these pieces and components going [00:30:05] to work together in order to run and have these data center factories work well and be future-proofed as we [00:30:11] scale out not just for vera rubin but vera rubin ultra and going on to to fineman we've started a new [00:30:20] initiative called the ai factory gigascale reference design and of course nvidia can isn't doing this [00:30:27] by ourselves working the entire community from cadence to emerald ai to etap ge verona jacobs which is the main [00:30:35] engineering consulting firm that builds a lot of these data centers schneider electric siemens and vertiv [00:30:42] to build the cooling plumbing and electrical systems that can scale to deliver these kinds of this kind [00:30:48] of future data center in a way that's future that take a long time to build and need to have a reference [00:30:54] architecture they all and all these components need to talk to each other at the cdus and the power and [00:31:00] the data center operations have to work as one along with the gpus and computer infrastructure so they [00:31:05] work seamlessly maintain uptime and high efficiency working with these partners now and we expect to [00:31:10] have the first version of the reference design done in our next upcoming gtc conference that's my [00:31:18] update for today i thank you everybody it was really exciting to launch cpx here at a infra [00:31:24] and look forward to the rest of the talks

Related Transcripts from NVIDIA

Transcribe Any Video or Podcast — Free