Try Free

NVIDIA Keynote at AI Infra Summit 2025: Advancing Innovation in AI infrastructure

NVIDIA June 3, 2026 31m 4,987 words
▶ Watch original video

About this transcript: This is a full AI-generated transcript of NVIDIA Keynote at AI Infra Summit 2025: Advancing Innovation in AI infrastructure from NVIDIA, published June 3, 2026. The transcript contains 4,987 words with timestamps and was generated using Whisper AI.

"I am delighted to welcome to the stage Ian Buck, VP of Hyperscale and High Performance Computing at NVIDIA for his keynote on Advancing Innovation in AI Infrastructure. Welcome, Ian. Well, good morning, everyone, and thank you to YJ. We spent a lot of time together. Building AI infrastructure is a..."

[00:00:00] I am delighted to welcome to the stage Ian Buck, VP of Hyperscale and High Performance Computing [00:00:05] at NVIDIA for his keynote on Advancing Innovation in AI Infrastructure. Welcome, Ian. [00:00:11] Well, good morning, everyone, and thank you to YJ. We spent a lot of time together. Building [00:00:15] AI infrastructure is a challenging and brave endeavor, and building infrastructure that works [00:00:24] not only a single rack, but an entire data center with that much compute to keep all those [00:00:29] researchers active and busy and humming along is a challenge. And of course, as you know [00:00:34] now, we're actually building those data centers, the next generation data centers, and every [00:00:38] year we have a new platform, and that's just the pace at which AI is going. So here at the [00:00:44] AI Inference Summit, I wanted to talk through some of the perspectives of how NVIDIA thinks [00:00:48] about this market, particularly around inference, which is a highlight and a focus of so many [00:00:53] of us today. You know, inference is a pretty complicated landscape. Often people think of [00:01:01] training as being hard. Certainly it's the 100,000 GPUs that we just talked about, but inference [00:01:08] itself has so many vectors of optimization and tradeoffs people have to consider. There's the [00:01:14] intelligence, how big of a model I want to deploy. It goes directly to the value of the model, but a bigger [00:01:19] model costs more to run. The responsiveness of the model, how much compute should I put behind a [00:01:25] particular model to make it more interactive, faster, more tokens per second for the user, or more tokens [00:01:31] per second for my entire data center, are tradeoffs that we can make on how much infrastructure, how much [00:01:37] revenue versus how much experience I want to provide for every user. We have cost. Obviously, there's a [00:01:42] litany of different hardware platforms, configurations from H100 to B200 to GB200 with GB300 and picking the [00:01:51] right cost and the infrastructure you want to play into it and to build and run those models. [00:01:57] There's a tradeoff between throughput and tokens per second per user. I can serve a model with incredibly [00:02:04] fast if I throw a massive number of GPUs at one query, but my data center generates revenue for all my [00:02:11] queries. In that tradeoff, the more I'm spending on one person's queries, the fewer total queries I can [00:02:18] do for all the users. We have to think about our tradeoffs in that Pareto. And finally, energy [00:02:23] efficiency. Our data centers are measured not necessarily in square footage, but in megawatts or [00:02:30] gigawatts. So understanding how much energy efficiency I can operate for a single model matters. All of these [00:02:37] tradeoffs are tradeoffs we have to make in our infrastructure and what we decide to build and also [00:02:44] our roadmap and how we innovate and think about the future. Hardware takes a long time to build. [00:02:49] Silicon takes even longer. We have to think about all those tradeoffs and project out a year or two [00:02:55] years ahead in AI, which is nearly impossible. One of the most challenging parts of my job to think about [00:03:01] these tradeoffs and make the right bets. At NVIDIA, another way of thinking about it is looking at [00:03:08] this acronym which we call SMART. First, you have to think about the scale and complexity. What is the [00:03:15] scale and complexity of the infrastructure you're building and how can we make it scale and scale more [00:03:19] efficiently? We have to look at that multi-dimensional performance. Every decision, we'll have to look at the [00:03:24] different axes of how much intelligence, performance, throughput, cost, energy efficiency, and how those [00:03:31] different metrics weigh into the final end solution for an in-app factory. In the end, it becomes a full [00:03:38] stack solution. You have the chip architecture, the RAC, the node architecture, the RAC architecture like you [00:03:45] saw, but then the multiple layers of software and software optimization that go on top. All of that comes [00:03:51] together to the solution. It's hardware plus software that creates the total end throughput and performance. [00:04:00] In the end, that delivers ROI. In inference especially, performance equals revenue. I'll give you some [00:04:08] examples of how the math works there, but performance really is paramount to having the generating tokens, [00:04:15] to getting those queries processed. It is a performance to revenue where training tends to be about [00:04:23] capability per cost. Finally, all that is happening not in a vacuum, but as a community. All of you here [00:04:31] who will help building the AI infra of the future, many of you representing the companies that are building [00:04:36] data centers or building hardware or building RACs, connecting them all together and serving it, [00:04:42] but also the open software community. YJ talked about OCP, open source software like OpenAI Triton, [00:04:48] PyTorch, and the many other software stacks. All of these innovations are coming from the community. [00:04:54] Unlike many of the other computer revolutions of the past, AI is particularly open where researchers and [00:05:03] companies are actually openly sharing their ideas, publishing their results, blogging about how to [00:05:09] advance the future. And that's because it's a rising tide. As we all do that, we accelerate AI, [00:05:15] we increase the opportunity for more AI, more models, more capabilities, which feeds a virtuous cycle. [00:05:23] And NVIDIA, our contribution, again, is to continue to look at ways that we can increase the performance [00:05:30] per watt per dollar. And as a result, turn it into profit for an AI data site. The most recent innovations [00:05:38] and focus was around NVLink 72. We took what was originally a server box that had eight GPUs that were [00:05:46] NVLink connected of H100 or H200 or B200 and disaggregated it. We moved NVLink out of the box and made it rack scale. [00:05:55] An incredibly challenging endeavor. It built some amazing infrastructure that's being deployed now and [00:06:00] being used now. We had to think and rethink about how we do ethernet scale. We want to scale up beyond the [00:06:08] traditional 10, 20, 50,000 GPUs connected with InfiniBand to the hundred to million GPUs that want [00:06:16] to be connected with ethernet scale infrastructure. A very different kind of ethernet. An ethernet where [00:06:21] every GPU or every node wants to talk to every other node at full performance. This is not what [00:06:27] ethernet was designed to do. It was designed to be able to connect clients with servers and have a [00:06:35] usage case pattern where it isn't everything, talking everything all the time. That required a new kind [00:06:40] of ethernet and a new kind of switch. And that's the basis for our Spectrum X ethernet platform. [00:06:46] Numerical methods. AI is a statistical problem. It's really just giving it data and producing a [00:06:54] prediction. And using techniques, statistic techniques combined with numerics, can open up the space of [00:07:03] numerical representation. Everything started with FP32, 32-bit floating point, we all know. Google, [00:07:10] 10 years ago, did BFloat16. IEEE has now added FP8, which we support going all the way down now to [00:07:18] 4-bit floating point. In fact, there's multiple 4-bit floating point formats. And recently we've been in [00:07:24] Blackwell bring brought to market the NVFP4 format, which actually has a micro-tensor scaling to try to keep all [00:07:30] these math, the entire AI operating in just 4 bits. To keep that statistically in range without things [00:07:39] capping, we're constantly having hardware capabilities to keep the calculation totally [00:07:45] in scale with constant biasing and re-biasing of the calculation. NVFP4 is being used for inference, [00:07:53] and actually more recently, we've been able to figure out the numerical algorithmic techniques to do NVFP4, [00:07:58] 4-bit floating point for training as well. This was recently announced in Hotchips. [00:08:04] Doing the math is easy, doing the numerics, incredibly complicated. Software is a big part of it. Of course, [00:08:12] NVIDIA is just one contributor to the software ecosystem, specifically released back at GTC, [00:08:19] announced NVIDIA Dynamo, and I'll talk more about that. Our ability to take inference and disaggregate [00:08:24] it across all those servers to be able to do context processing and generation and really scale up the [00:08:31] number of GPUs we can assign to a single model requires an entirely new kind of software stack for [00:08:38] serving models. In the end, this results in ROI. We have Blackwell delivers about 10 times the return on [00:08:46] investment. This is comparing the amount of dollars you spend on infrastructure to how much you can [00:08:52] generate in revenue in your AI factory. And of course, none of this happens alone. I manage teams that work [00:08:59] on PyTorch with YJ, also JAX and other software stacks like OpenEye Triton, but also for [00:09:07] inference, we have SGLang, VLM. NVIDIA has their own inferencing libraries called TensorFlow TLM. [00:09:14] All this is happening in a community, an open community, and by contributing and supporting [00:09:19] and providing the foundational software for these technologies or allowing researchers and developers or [00:09:25] companies and clouds to be able to use it, we can elevate and bring all these technologies to market [00:09:31] kind of in real time at a pace of, well, accelerating pace. It's quite amazing. [00:09:43] Performance is a difficult thing to measure, particularly in the inference. And, you know, [00:09:48] it's easy to make a claim, but how accurate is the model? I can take a model and I can run it and reduce [00:09:55] precision. But if my precision isn't, if I'm not delivering the same accuracy in a reduced precision [00:10:04] format and running faster, I might as well be running a cheaper model. MLPerf is a benchmark that's been [00:10:10] around for quite some time. In 2019, they started MLPerf Inference. This is where Google, NVIDIA, [00:10:19] Meta, and many other companies in the industry come together, create a formula where we peer review each [00:10:25] other's results to run benchmarks and actually prove that we can actually deliver a model with a certain [00:10:31] amount of performance, with a certain token rate for a particular user, and agree upon that this is the [00:10:37] established standard for measuring inference. Since 2019, NVIDIA has been submitting and running on all [00:10:45] the benchmarks going back to the Ampere architecture through Hopper to Blackwell. And, you know, [00:10:49] we've been continuously, every year, you know, delivering new benchmarks and rising the performance [00:10:58] of AI models and inference. We hold every single per-GPU record since the MLPerf [00:11:05] data center benchmark started. In fact, this last round, which just gets announced today, I think [00:11:10] later on, you'll hear from David Cantor, we've recently added DeepSeaGAR1, the new LAMA 3.1405b and [00:11:17] LAMA 2 benchmarks were announced just recently on the new Blackwell Ultra or GB300 platform. [00:11:24] This was used, we used the NVFP4, we used the Dynamo software, we used the new TensorRT, and we fully [00:11:31] distributed the inference across the NVLink of the entire GP300 bracket. In fact, [00:11:39] the software to do all that is quite complicated. And, literally, since we've launched Blackwell [00:11:45] last year to today, we have actually doubled the performance of Blackwell just with software. [00:11:51] It's the complexity of how you distribute the communication, apply the new numerical techniques [00:11:57] to improve performance, reduce cost, and get better throughput on MLPerfs in 2x in software on the same [00:12:04] Blackwell totally for free. The same story played out in Hopper. We actually quadrupled Hopper's [00:12:10] performance over its lifetime on the same Hopper four times faster just through software and applying [00:12:18] all the optimizations by NVIDIA engineers, but more importantly by the open source community and [00:12:23] researchers were figuring out new ways to serve the same models at the same grade of accuracy, [00:12:29] just faster, cheaper, and generating more revenue. [00:12:35] This is kind of the math that is today. Actually, a $3 million GB200 and NVL72 rack actually will [00:12:42] generate about $30 million in token revenue to the point where actually a free GPU is not even cheap [00:12:48] enough. If you kind of compare like, let's say, a previous generation or alternative platform with [00:12:54] a quarter of the performance. You can see at the bottom there, the gray is actually the GPU cost and [00:13:00] also the shell and server cost that all goes around it. If I even delete or remove the GPU cost, [00:13:07] you can see that the revenue we generate over the course of multiple years is demonstrable. This is [00:13:15] how we feel about inference. The performance of the platform is the revenue of an AI factory, [00:13:22] and Blackwell literally delivers a 10X on what you buy and what you make. [00:13:30] So let's talk a little bit more about inference and how it works and where the innovations are [00:13:34] happening right now in inference and what we're focused on. Inference is actually two different [00:13:38] workloads. On the left, on the right here, you get a query that comes in and you do traditional [00:13:44] inference serving. So then the first thing the AI model does is all the context processing. This is [00:13:50] literally the question that you may ask a chat GPT or chat bot, but also all the other tokens that come [00:13:56] in that are unique to you or the system prompt. So these are things that you've asked in the past [00:14:02] or things that are natural to your query that should assist the AI in answering the question [00:14:07] that it's pulling from databases. It doesn't just look at the of your question, but all the other input [00:14:13] tokens. That's called the context and pre-fill phase. And then after it's processed all the query [00:14:19] and all the related data, all the input token, then it actually starts outputting tokens that you're [00:14:24] reading. And that's the generation phase of decode. Typically, we do this over a cluster of GPUs, [00:14:30] depending on the model. It might be four GPUs, eight GPUs, or even multiple GPUs, depending on the model size [00:14:38] and the performance, but it's generally running one model across one set of GPUs. What's interesting, [00:14:44] though, is that the context processing and the generation are actually different. They're both [00:14:47] running the same model, but the context processing can be done in a massively parallel way. We can [00:14:52] process all the input tokens in parallel and where the generation of AI tends to be auto-regressive. [00:14:59] Every token gets outputted. You have to run it again to calculate the next token, to calculate the next [00:15:03] token in an auto-regressive way, where processing 16K, 32K, even 100K input tokens can be processed in [00:15:12] parallel. As a result, you actually have this performance delta where the content is very compute-rich [00:15:20] heavy because we can do a massively parallel. We can all do it at once, where the generation decode, [00:15:24] because it's auto-regressive, needs a combination of memory bandwidth, NV-link bandwidth, and compute in [00:15:30] order to fastly output numbers of tokens. If we stick on the same platform or one GPU, we end up having [00:15:37] to pick the best of both worlds, but sort of in the middle somewhere, but not necessarily the optimal [00:15:42] for these two workloads. Today, most modern data centers actually do disaggregated inferencing. So they [00:15:49] actually take that input query and they generate the context processing on separate GPUs, creating what's [00:15:54] called the KB cache, basically up with just the first token, and then it hands off the KB cache to another [00:16:00] set of GPUs, which are optimized for generation. This allows us to actually split the number of GPUs within [00:16:06] for context versus generation, and actually dramatically improve the overall performance. [00:16:13] The NVIDIA Dynamo software is designed to do this. It's all open source. You can go check it out on GitHub. [00:16:19] All of our development is GitHub first, so you can see live check-ins. But by doing this optimization, [00:16:25] we can actually configure those GPUs and run the model with different AI kernels in a parallel compute [00:16:33] context preview way. And then for the auto-regressive part, configure it with different kernels, [00:16:37] different parallelization techniques for the fast auto-regressive. This overall increases the total [00:16:43] throughput, the same number of GPUs. And in fact, that will generate a six times improvement. [00:16:48] Just for like LAMA models alone, it's about two to four. You can work from two to four X faster. [00:16:54] Same number of GPUs, just doing disaggregated. It's a lot harder because you have to have two sets of [00:16:59] basically workloads running in parallel on the system, and also having this KB cache transfer [00:17:05] between the two platforms and keeping everything busy. [00:17:10] We have, this is in production today. There's a company called Base10, which is an inference [00:17:16] aggregator, basically a model-serving company. They have over 8,000 GPUs of both Hopper and Blackwells [00:17:23] spread around, multiple clouds, including Google Cloud. They actually, when GPT OSS first launched, [00:17:30] they had the fastest inference performance of any cloud provider because they super optimized, using NVIDIA [00:17:38] Dynamo, that split between the context processing and the output generation. [00:17:43] This is an important example of how much software matters, both and how it combines with the [00:17:48] infrastructure of a rack like NVL72 and GB200. Overall, disaggregation gives you about six times faster [00:18:00] first token on models like QN. We're seeing 3X higher, faster token output on models like DeepSeq, [00:18:07] and basically turning inference into sort of a data center or inference scale kind of problem. [00:18:16] In addition to this, we're seeing that context processing becoming more and more important and [00:18:23] higher and higher value. Most of the models you see today can accept up to about 256,000 input tokens, [00:18:31] and there's roughly, you know, two to three tokens per word. So you can kind of get a sense of how much [00:18:37] input token they can consume when you ask a question to a typical chatbot. But there's this slice of [00:18:43] workloads that actually really love having super long input tokens. Two examples of that is advanced [00:18:49] coding. We've heard of coding chatbots basically to help you write code. Advanced coding chatbots take the [00:18:57] entire program and allow and use the AI to add new functionality. Instead of helping you write like a [00:19:03] little loop of code or fix fine bugs, advanced AI codes can take 100,000 lines of code or a million input [00:19:11] tokens of code. And actually be able to output new functionality, entire code blocks, entire portions [00:19:16] of the application to turn the AI really into a software agent that can interact with a software [00:19:22] developer in a totally different way. But you need to be able to process literally millions of tokens, [00:19:27] but the value is incredibly high because now I have basically a software developer that can amplify my [00:19:34] entire software developer workforce by like 10x because the AI is actually generating the initial [00:19:40] code that the developer can all work from at that kind of functionality scale. The other use case [00:19:45] that's really hot right now is video processing generation. Think of processing like an hour of HD video [00:19:51] and producing new video content that's generated videos, a lot of data, millions of tokens. Today the video [00:19:58] generation market is about 4 billion dollars for AI video and by starting the next decade it's projected [00:20:06] to be over 40 billion dollar market. This is in the entertainment space and also in the media and [00:20:14] marketing and advertising space. One way to think about it is you know we used to live in a world where [00:20:20] when we came home and watched on our TVs you know it was whatever was on the TV. We moved to the digital era, [00:20:26] now we have on demand, we can watch whatever we want. And by the end of the decade we're basically [00:20:31] going to be on interactive media. It won't be whatever on demand we want, but all the interaction [00:20:38] we could do for entertainment could be interactive and of course we've done through video. So having [00:20:44] these long context capabilities is really interesting and whenever we see this kind of opportunity at [00:20:51] Nvidia where there's a high value market with a place where we're pushing the limits like how big our input [00:20:59] context is, it's an opportunity for us to optimize further. So maybe there's a way we can actually [00:21:04] process or work on these high value large context instead and maybe not use the same GPUs for context and [00:21:12] generation but focus on bringing these large market, these new capabilities to market. And that's why [00:21:19] we announced this morning and here specifically at the at this AI conference a new kind of Ruben processor [00:21:28] dedicated for long context processing. This is the Ruben CPX GPU. It is a GPU specifically built for massive [00:21:38] context length processing for these high value use cases of million scale token [00:21:47] processing. It's specifically optimized for context processing and still of course CUDA capable. [00:21:54] This is a new Ruben GPU which we haven't disclosed or talked about before based on the same Ruben [00:21:59] architecture but a new instantiation. It has over 30 petaflops of NVFP4 and all CUDA capable. We've actually tripled [00:22:08] down on the attention processing. Attention is the building block of many of the models we we have [00:22:14] today and we actually have added new attention acceleration cores to to this chip which is three times faster than [00:22:23] what we have in the current GB300 GPU. Is memory optimized? So a lot of the compute processing of [00:22:30] context is compute rich. It's less dependent on HBM bandwidth or memory bandwidth and less dependent on [00:22:37] having a V-link scalability. So we can use the standard GDR7 memory that we use today and most of the GPUs [00:22:45] that are available in the market. And of course we doubled down on video. So we added four NVIDIA [00:22:53] video encoders and four decoders for processing and generating AI video content. And this will come [00:23:00] online in the end of 2026 right after our initial launch and availability of NVIDIA Ruben. So how do you [00:23:10] integrate this processor, this chip, this single die Ruben into the Vera Ruben rack? So here is Vera Ruben. [00:23:18] We announced this at GTC this year. This is it has over 3.6 exaflops of AI performance in a single [00:23:27] rack. This is coming to be available in the second half of 26. As you can see there's the compute tray. [00:23:36] Each tray has four Rubens, two Vera CPUs and Connect X9 for the scale out interconnect. It's a pretty [00:23:43] impressive platform. It's got over 3.3 times more compute in a rack than GB300 which is deploying today. [00:23:52] It will have 75 terabytes of fast memory and 1.4 petabytes of HBM4 and quite an impressive rack in [00:24:01] itself. It all sits in the same rack architecture as GB300. Hopefully to help YJ and others here to [00:24:07] deploy it actually will fit in the same mechanical and space. And as you can see 72 GPUs are packaged [00:24:15] GPUs in a single rack. This is actually a dual die GPU so it's 144. That's why we call it NBL 144. [00:24:21] Ruben's in a single rack. But let's talk about CPX. We can actually just add CPX to this platform. [00:24:27] In fact right down the bottom we have areas where we can insert additional the context processors and [00:24:33] really boost this rack up for million scale token processing. This is the Vera Ruben NBL 144 CPX. [00:24:43] All we have done here is taken the same tray the same architecture but we've inserted eight of our Ruben [00:24:50] CPXs behind the Vera in connection with the ConnectX9s and those processors are available for the entire [00:24:58] rack to do context processing. And we just totally boosted up the performance of the rack. As you can see [00:25:05] now we're up to eight exaflops or seven half times what GB300 can do today. We've increased our memory [00:25:11] again to 1.7 to 3x. Our fast memory has increased further to 100 terabytes and again all this will fit [00:25:19] nicely into the existing rack infrastructure so that for customers that want to prioritize Vera Ruben for [00:25:24] million token input contexts this is a seamlessly way to upgrade or integrate into their data centers. [00:25:33] We also don't have to put the CPXs in the tray. We'll also be making a CPX only compute tray version. [00:25:38] And in fact customers can actually just put it as a side a side card to their to their very ribbon [00:25:44] rack. This is a on the left there's a new tray called VR CPX. As you can see you have two various CPUs [00:25:50] and eight CPX processors connected with the same networking behind the scenes and they can add a VR CPX [00:25:58] rack in their data center side by side either one to one one to end whatever their mix between their context [00:26:04] processing or and their output generation where all the context processing is happening on the CPX rack [00:26:10] and all the generation token generation can be happening on VR and of course one to one ratio is fine they [00:26:16] can mix it to two to one or they can start with some and expand later all that makes it possible you [00:26:22] don't have to have them next to each other. The way context generation works is as soon as you have your [00:26:26] first token you just need to send that KV cash to your to your token generators wherever they are in your [00:26:32] data center. Quite an upgrade and and running really fast. We've already been working with some of the [00:26:39] Lighthouse customers who are super interested in long context. These are different AI innovators we've all [00:26:45] heard of Cursor who is probably one of the leaders in intelligent code generation and NVIDIA uses Cursor along [00:26:53] with many others and this will help them get to the next level of development productivity with those [00:26:58] million to input token code generators. Magic actually is a magic.dev has a hundred million token [00:27:08] input model quite impressive and we're working with them to figure out how to get that working on CPX [00:27:14] along with runway and uh which which is a company which generates cinematic video and other uh leading [00:27:22] inference providers like fireworks and together AI who have some of the most advanced techniques for [00:27:28] the fastest model model serving and how they can get to that next level of million million token inference. [00:27:38] So we've added another chip to our roadmap uh you can see here we have on the not in Blackwell [00:27:44] we have the the Blackwell and Blackwell ultra for the gray CPU uh our NVLink switch chip the [00:27:52] spectrum 5 uh switch and of course the CX8 NIC all of these chips come together to make AI and AI infra [00:28:00] work it's never just one chip it's a family and now with Rubin we've added the CPX processor a different Rubin GPU [00:28:09] dedicated to uh and optimized for context processing that'll be paired and matched with Rubin for the [00:28:16] one the million scale context processing and fits nicely into the full family and of course that'll [00:28:23] extend and look forward to talking more about Feynman when uh when uh when we get a little bit closer. [00:28:30] All this has to come together the AI is not served and data centers aren't built with one processor they're [00:28:37] connected machines they require CPUs they require GPUs they require various levels of accelerating the [00:28:44] network and infrastructure at scale is needs to work as one in order to serve these models and bring that [00:28:50] token value and that revenue that inference will generate all together in one and that's what we're [00:28:57] focused on at NVIDIA is bring all that infrastructure and the baseline software stack to market as quickly [00:29:03] as we possibly can. One challenge of course is how do we build those future data centers showing you a [00:29:11] lot of racks we've shown a lot of chips um the you know the next challenge of course is can what is that [00:29:18] future data center going to look like YJ talked about how the CPU data centers have evolved and how they're so [00:29:24] different and you know uh we're NVIDIA is also a huge proponent of open standards or members of OCP we've [00:29:32] contributed the GB 200 rack to OCP and we'll do and we'll do so for the upcoming uh infrastructure as [00:29:39] well but the problem now is becoming a data center scale one how can we build and provide and working [00:29:44] with the community a data center roadmap not just a rack and GP roadmap that's going to be future designed for [00:29:51] the future allows us to scale and grow it's pushing the limits of power generation mechanical plumbing [00:29:58] electrical you know bus bar design row length cdus and how are all these pieces and components going [00:30:05] to work together in order to run and have these data center factories work well and be future-proofed as we [00:30:11] scale out not just for vera rubin but vera rubin ultra and going on to to fineman we've started a new [00:30:20] initiative called the ai factory gigascale reference design and of course nvidia can isn't doing this [00:30:27] by ourselves working the entire community from cadence to emerald ai to etap ge verona jacobs which is the main [00:30:35] engineering consulting firm that builds a lot of these data centers schneider electric siemens and vertiv [00:30:42] to build the cooling plumbing and electrical systems that can scale to deliver these kinds of this kind [00:30:48] of future data center in a way that's future that take a long time to build and need to have a reference [00:30:54] architecture they all and all these components need to talk to each other at the cdus and the power and [00:31:00] the data center operations have to work as one along with the gpus and computer infrastructure so they [00:31:05] work seamlessly maintain uptime and high efficiency working with these partners now and we expect to [00:31:10] have the first version of the reference design done in our next upcoming gtc conference that's my [00:31:18] update for today i thank you everybody it was really exciting to launch cpx here at a infra [00:31:24] and look forward to the rest of the talks

Transcribe Any Video or Podcast — Free

Paste a URL and get a full AI-powered transcript in minutes. Try ScribeHawk →