Networking for AI Scaling, presented by Broadcom

[00:00:00] Speaker 1: Good morning. Thank you, Muhammad, for that excellent presentation. Between yesterday and this morning, you know, you've heard a lot of news. And if you look at the last couple of weeks, one single company has announced 10 gigawatt data centers plus another 6 gigawatt data centers plus another 10 gigawatt data centers. You're talking about a single company over the course of four weeks announcing about 26 gigawatts of data centers. If you think about it, 26 gigawatts of data centers equates roughly about 15 million, you know, XPUs. By the way, during the course of my next 15 minutes, when I refer to XPUs, they could be GPUs, TPUs, or somebody building their own accelerators. So I just want to make sure everybody gets that nomenclature understood. Now, when you think about it, why are people building so many data centers, so many gigawatts, and, you know, building this massive infrastructure? It's because machine learning and AI is a distributed computing system. And we've actually talked about this three years ago, which is, at the end of it, that no one XPU is large enough to handle the workload that needs to be, you know, done. And you have to have many of these XPUs stitched together, and that's where the network plays an extremely important role. And the network is the computer. And three years ago at OCP, we took the position that the networking technology of choice is Ethernet. And at that point in time, there was another technology people thought was the only way to build, you know, AI clusters. And we said, no. It has to be Ethernet because Ethernet is open, it's resilient, and it's very economical. Right. And with that, you know, in mind, I want to kind of share with you the advancements in Ethernet that have happened over the last couple of years, and true to what we promised three years ago, you know, what we've been able to deliver in the marketplace. So if you look at it, in terms of a network for AI, there is obviously the scale-up, there's a scale-out, and there's a scale across these data centers. Ethernet is the only technology. I'll say this one more time. Ethernet is the only tech-working technology that actually cuts across all of these. Across scale-up, scale-out, and scale across data centers. I believe in Ethernet so much that I carry this Ethernet cable in my pocket today to show you. Something as simple as this is the only way you can build a machine learning cluster that's large enough. By the way, it's very hard to find, it's very easy to find somebody who can work on Ethernet technologies. Very hard to find somebody who's going to work on all these other esoteric technologies that people have pitched as the way to build machine learning clusters. Now, when you look at scale-up, one of the things to think about it is why there's so much talk about the scale-up and what's going on in the scale-up. And really what scale-up is trying to do is it's making sure that when you have multiple XPUs, the HPM memory on one XPU is available to another XPU. And when you think about what's happening, the amount of bandwidth that each XPU today has with the HPM attached to it is about 40 terabits today because it has four HPMs each running at roughly 9.6 terabits per second. Tomorrow, you'll see eight HPMs each running in close to 12.8 terabits per second to a total bandwidth of 100 terabits per second. So when you have two XPUs and you want them both to talk to each other and it goes over a network, you want to make sure the network has that very high bandwidth to facilitate this. That's number one when you think about scale-up. Now, here is where Ethernet comes to play. There's been discussions yesterday, you've seen in the keynote, about ESUN. ESUN stands for Ethernet for Scale-Up Networking. And here's the beautiful part about it. As I said, Ethernet is the only open technology, truly open, by the way. And I'll show you why it actually makes a big difference here. On the top of this picture is accelerators, XPUs. And I've just put them as A, B, and C, right? A is one company is making it, B another company is making it, C another company is making it. Because if you think about it, if the world has 15 plus million XPUs being bought by a single company, you don't want to have one vendor that's making these XPUs or GPUs or whatever you want to call it. You want a heterogeneous environment, otherwise we'll be living in a massive monopoly. So when you have these different companies making these XPUs, think about it. There's all engineers in these companies. They all have their own inventions. They all want to move at their own pace. The last thing they want to do is wait for somebody else to write a specification on how that XPU is going to do scale-up. So when you look at any other alternative technologies outside of Ethernet, they try to define what happens in the XPU and then what happens in the link, whatever you want to call that link. But Ethernet does it very differently. What it does is it does a clean demarcation between what happens in the XPU and then what happens on the Ethernet networking layer. By doing so, what you're able to do is let all these XPU companies compete on the merits of their own XPUs, innovate, and essentially do the scale-up the way they see fit in terms of how they're going to schedule the traffic, how they're going to handle the memory semantics, and how they're going to handle the software layer that sits on top of it. And in the meantime, the Ethernet layer underneath is very simple. It's all either standards-based, existing specifications-based, nothing proprietary. Okay, that's extremely important. Ethernet, open, nothing proprietary. And here's the thing I will tell you. When somebody says it's open, you should not see words like, I am a certified platform, right, or my NIC only works with my switches. Those are not what you should be seeing. You should truly be seeing a truly open interface. And that's what Ethernet brings to bear. Because of that, there is an agreement in the industry that says, look, Ethernet for networking is the best technology for scale-up. And let's all agree on it, and let's standardize on it based on existing specifications, and let everybody go build their Ethernet switches, let everybody go build their XPUs as they see fit. And that's what the companies have come together here. So if you look at the names of the companies here, these are all companies who are leaders in their own rights, in the markets that they're playing in. And they've all collectively decided that we can make Ethernet work for scale-up networking, but more importantly, there's nothing new to be made, it's existing Ethernet. And they're working towards accelerating the adoption of Ethernet for scale-up. So one more thing here, again, the beautiful thing about Ethernet, it decouples what happens in the XPU from what happens in the network, and you can use existing Ethernet switches, both that have been built and will be built to do scale-up. And that's the very powerful part of it. And to facilitate the scale-up, it's not just about building the silicon, building the hardware for it. We also want to make sure that there is software for it. And so there's a work group that's being formed in OCP to make sure the software that is needed for scale-up Ethernet is available. And that's a version of Sonic. Just kind of think of it as a stripped-down version of Sonic with very specific features optimized for scale-up. So the ecosystem of scale-up on Ethernet is here and available. Now, one of the questions that have come up in the past is, well, we know Ethernet is open. We know Ethernet is ubiquitous. We know it has high bandwidth. But can it achieve the lower latencies that might be required for scale-up use cases? And it can. And this is actually a slide that shows that you can achieve sub-400 nanoseconds latency for data traversing from an XPU through a switch back into an XPU. So nothing about Ethernet limits its ability to build the lowest latency interconnect that's available. Now, the next question that comes up is, okay, today we are limited on what we do on scale-up to about less than 100 XPUs inside a rack. But when you actually go talk to the people who are developing these large language models, they want to see a scale-up that's a larger domain. Beyond less than 100 to a couple of hundred, potentially even kind of going to 1,000 plus. There's only one networking technology that has been proven in the marketplace for decades that can reliably scale across rows, across data centers, and that's Ethernet. So we will deliver low latency on Ethernet. We will deliver the reliability on Ethernet. And we will actually increase the size of the scale-up domains beyond being constrained to a physical rack and the limitations of copper inside a physical rack to go across racks. To do that, obviously, what you need to do is have Ethernet switches that are continuing to increase in bandwidth. And all I'm showing here is an example of us as Broadcom being able to double the bandwidth every 18 to 24 months. But that's the beauty of this Ethernet marketplace. There's a significant competition that it forces all of us to be on our toes and deliver the best product in a truly open fashion as fast as we can and keep moving this industry very fast. And when you have this kind of a bandwidth, we've talked about what we can achieve in scale-up, but also let me touch on what we can achieve in scale-up. There was a presentation yesterday from Oracle, and there was something very profound out there that said, look, when you build a very large domain size, the big benefit that you get is you actually increase the utilization of the system, and you get far more efficiency by building larger and larger clusters. And shown here are two possible ways you can build a 128K GPU cluster, one that uses a 50 terabit switch, and another one that uses a 100 terabit switch. And what you find is when you double the bandwidth of the Ethernet switch to 100 terabits, you reduce the number of layers in the network, that reduces the congestion in the network, reduces the number of optics, reduces the number of switches which are needed, reduces the amount of latency that is, you know, occurred through the overall network, and all of this is great for job completion time. Think about it this way, a 100,000 GPU cluster is probably going to cost you a minimum of $3 billion, depending on how well you negotiate. The network for the same thing, an Ethernet network for it is probably going to cost you less than $100 million, you know, excluding the cables and the, you know, NICs which are needed. So if you can build a high performance network, and there was a presentation that was done a couple of years ago that if a network is not done right, you can idle your GPUs for 50% of the time. So by building a high performance network, the network pays for itself. And the only high performance network, my friends, is an Ethernet one, because it's truly open, it's truly competitive, and you can see in the marketplace today that Ethernet has a lot of choices and, you know, advancements. And talking about advancements, one of the things that I want to share with you is to get the bandwidth out of these switches, obviously you have copper as one of the options to get it out. But increasingly, as you go for further distances, you want to go on optics. And when you choose optics, you have a choice of pluggables, LRO, you know, linear pluggable optics, or you can go to near packaged optics or potentially co-packaged optics. There was a recent paper published by Meta that actually has tested reliability of co-packaged optics for about a million hours. And they found that the co-packaged optics are actually more reliable than pluggable for a host of reasons, just the way the box is integrated, to say the least. Now Broadcom is in the third generation of co-packaged optics. Third generation. For those who thought co-packaged optics just got announced last year, it's wrong. We are in the third generation of co-packaged optics. And again, in the spirit of openness, Broadcom switching platform not just supports our own co-packaged optics, but we also have co-packaged optics from our partners, like NTT, that we support. One of the other things I want to quickly touch on here is three years ago when we talked about scale-out and how Ethernet could be done on scale-out. We were working with the Ultra Ethernet Consortium to improve the performance of RDMA on Ethernet. And one of the things we needed to do was actually have a NIC that is capable of doing advanced RDMA functionalities that is compliant with the Ultra Ethernet specification. Today Broadcom announced our chip, which is called the Tor Ultra, which is the industry's first true 800-gig NIC. This is not a NIC that has two 400-gig flows pretending to be an 800-gig NIC, but it's truly an 800-gig NIC. Two different form factors, either 8 by 100 or 4 by 200, that will deliver all the Ultra Ethernet compliant features. Specifically focused on enhanced RDMA, which is multipathing, out of order placement, selective retries, all the things which are needed to build hundreds of thousand GPU clusters for scale-out with RDMA. And the one thing I promise you on this NIC, it will not say if it is not connected to a Broadcom switch that the functions are going to degrade. It's truly open. The adaptive routing functionalities, the advanced RDMA functionalities, you can connect it to any Ethernet switch in the world. It will deliver the same features. You can connect it to any cable in the world. It will not degrade performance. And you can connect it to any XPU. It will work fine. And that's truly what's the benefit of Ethernet, which is open, interoperable ecosystem and technologies. And then, obviously, we talked about scale-up, scale-out. And you have to go across data centers, because oftentimes you don't find one gigawatt of capacity in one location. And that's what we are able to do with de-buffered switches. So I want to rest my case by saying, Ethernet is the only technology that can cover all the use cases for AI, from scale-up to scale-out to scaling across data centers. And there's going to be products available from a multitude of companies in this space. You probably already have heard and will hear from them. And the entire suite of technologies which are needed, either on the host side, the NIC side, or those Ethernet switches, are available on Ethernet. So I want to leave by saying, the network is the computer. Ethernet is the only way to network. And it truly is the most open, efficient, and reliable networking technology that's out there. Thank you for your time.

Related Transcripts from Open Compute Project

Transcribe Any Video or Podcast — Free