Inside the World's Largest AI Supercluster xAI Colossus

[00:00:00] Speaker 1: This is the largest AI cluster in the world. [00:00:03] Speaker 2: The Teamate X is building a massive AI supercomputer that encompasses over 100,000 GPUs, exabytes of storage, super fast networking. This place is absolutely amazing. And this entire supercomputer is built to power Grot. [00:00:21] Speaker 1: Now, XAI is building something with Grot [00:00:23] Speaker 2: that is far more than just a simple chatbot [00:00:26] Speaker 1: like we've seen before. And that is exactly why there is a giant cluster here. [00:00:31] Speaker 2: Today, we're gonna go back inside these data halls and show you a whole bunch of stuff on what makes this work, what makes this special, and just all the really cool engineering that went into this. Now, this is actually my second time at the facility. And let me just tell you, the speed at which this thing was built is absolutely amazing. This entire facility with over 100,000 GPUs was built in only 122 days. And just to give you some frame of reference, the largest exascale supercomputers only have a fraction, maybe half to a quarter as many GPUs as XAI has here. And yet, those deployments generally take many years from start to finish. The engineering accomplishment here is absolutely amazing. And there's work still being done. But I thought, let's go take a look at inside one of these data halls. Let's also show some of the facilities and just kind of show you what's going on and how something like this gets built. We need to say that this video is sponsored by Supermicro, since it is. But I also just want to say thank you to the X and XAI teams for giving us permission to go and film this. Also, of course, thank you to Elon and his teams for approving this and making this possible. With that, let's get inside a data hall to see how this all works. Inside the data hall, XAI is using a pretty common design. This is a raised floor data hall and above, we have our power. Down below, we have all of the pipes for the liquid cooling so that he can be exchanged to the facility chillers. Each one of these compute halls has about 25,000 GPUs. Plus all the storage, the fiber optic, high-speed networking. It's all built into the hall and then they're connected together. The connections to each data hall are basically the fiber optic cables, the liquid cooling and plumbing for the water. Then there's also just a bunch of power delivery that's super cool. Inside the compute hall, we have these clusters. Now, each of these is made up of eight liquid-cooled racks by Supermicro. These are NVIDIA H100 racks and inside each of these eight racks, there are eight NVIDIA HGX H100 platforms. We also have all of the liquid cooling for these as well as the networking and stuff to make each one of these about a 512 NVIDIA GPU mini cluster. Now, these Supermicro NVIDIA liquid-cooled AI racks are probably the most advanced AI racks deployed at this scale. And I'm gonna show you exactly why right here. So what you're gonna see is that each of these racks has a total of eight NVIDIA HGX H100 systems. So we have a total of 64 GPUs per rack. Now, in the top section, we have the Supermicro NVIDIA HGX H100. Each one of these HGX H100s has a lot of the components that are really important in these systems like the eight NVIDIA H100 or Hopper GPUs. Plus, there's also NVIDIA and V-Link switches and all of that is on the baseboard. Now, one of the really defining things of the Supermicro platform versus some of the others in the market is that you can actually go in and pull this top section out. You can see the little levers here. They'll probably get really mad if I did this, but we have other videos where I've done that and even on these liquid-cooled systems. And that leaves the bottom tray. The bottom tray has things like our CPUs, which are fast x86 CPUs, as well as our large PCIe switches. Just to give you some frame of reference on how advanced these are, all of that is only in 4U of rack space and it's all serviceable just on trays. There are other options from Supermicro and others in the industry that are 6U or 8U for a similar style system. And there are options in the market that simply don't have this kind of accessibility and serviceability, which is why these things, even though they're very compact, they're also extremely advanced and easily serviceable. Now, on the front of this, you're definitely gonna see the fact that there are all these little tubes and they go through this little bar. This little bar is what is called a manifold. And so we have a 1U manifold for each of these systems. That 1U manifold is how we connect our liquid cooling. Now, all of these little tubes are in pairs. We have both a blue and a red tube that comes out of each of these. So inside these tubes, there are two different liquid cooling blocks. And as you might imagine, the cooler liquid goes into the server in the blue side and then out of the server on the red side. And that is brought to the manifold that's here and it goes back to the overall rack manifolds that are in the back of the rack. This design means that you can go and actually slide these systems out, service the HDX H100 board, service the CPUs, memory, and all that kind of stuff and just go pop these things out. Now, I'm not gonna do that. This is actually training a real model right now, but it takes a matter of seconds at most. Now, there are a total of eight of these servers in this rack. So it's a total of 64 GPUs plus 16 CPUs, a bunch of memory and all kinds of other stuff. But on the bottom is a big part of what makes us a really scalable solution. These are Supermicro CDUs, which is a cooling distribution unit. So there are a couple super cool things about the CDU. First, you have the management unit. So each one of these CDUs has its own management so you can monitor things like flow rate, temperature, and all these kinds of things that you need to to just ensure that you're bringing the right amount of liquid through all these servers in your rack. And that of course ties into the central management interface and that can be monitored. So if something goes wrong, you can see it remotely. Now, the other cool thing is that there are two pumps down here. Now, these pumps you can actually service because they're for redundancy. If the pump were to go out, you just pull the thing out, then you go put it back in. Again, I'm not gonna go do this on a live system, but we've done it previously on SDH. Okay, so it's time to get to the other side of the rack just to kind of give you an idea of how the back of these racks look. On the side, you'll see that we have our overall rack manifold. You're going to again see our red and blue liquid cooled rails. Now, behind that, you're going to see that we have a bunch of three-phase power strips that are in here since, you know, these things use a lot of power in each rack. And on the other side, you're gonna see that we have all of these servers. Now, these four-use Super Micro servers have a total of eight NVIDIA Bluefield 3 SuperNICs, and that's really there for the AI network. There's also a Kinect X7, and that's really for all the other things that the server might need, like on the CPU side. Now, you might wonder, why are there fans in a system that's liquid cooled? The real reason is that these fans are needed to cool all of the little tiny components, the memory, the DIMMs, and all that kind of stuff in a system. Now, even though there are fans here, it's definitely not as loud as if this were all air cooled. And another really important thing is that this isn't too hot. Like, I'm just standing here, and it doesn't feel like I'm being heated, like if I were behind an air-cooled system that was just blasting all that hot air towards me. So there is a big difference in the fact that this is a liquid-cooled versus being an air-cooled system. Now, on the back of this rack, there is a rear-door heat exchanger. Now, how a rear-door heat exchanger works is that the heat from the server is transferred to the liquid that flows through the radiator. Air is being pulled by these large fans through the heat exchanger, and that's how all the extra heat in the racks ends up getting removed. The special part about this design is that each rack ends up being room neutral to the overall cooling of the data center. You don't see as you're walking around here like these giant air conditioner or air handling units, something that you would see in a lot of data centers over the years. This is a really cool feature, and it really helps each of these racks be a self-contained unit. And another super fun fact here is that you'll see that the back of these is lit up in blue, and you might think that's a branding thing, or maybe they did that just because STH is blue and that's why it's blue. Absolutely not. Instead, this is actually a status light. So as you're walking through here, if you see a bunch of blue, that's cool. If you see something like red or something like that, that's not good. So when you're walking down the data center hall, if you see a red one, you know, oh, that one needs to get service, and the rest are okay. A couple weeks ago, I got to see these things get fired up, and it was super awesome to see all of them come up at the same time. Now, of course, with any large-scale cluster, you need CPU Compute as well. For those tasks that GPUs are just not really good at. And that's exactly what these are next to me. These are 42 one-use servers per rack that provide all of the CPU Compute when you need to do things like data prep and all that kind of stuff that really works well on CPU. And so in any large cluster, you're always going to see a set of CPU Compute Nodes along with your GPU Compute Nodes. Now, this entire cluster runs on Ethernet, which is the same basic technology you would find networking for your laptop, your PC, or a bunch of other devices. Now, each of the servers uses a NVIDIA Bluefield 3 SuperNIC DPU. We have definitely covered the NVIDIA Bluefield 3 DPUs and previous generations on STH for many years. And if you've seen that, you probably know it means that there's a lot going on here more than just your basic Ethernet. Each of these NVIDIA Bluefield 3 cards provides 400 gigabit networking all the way up to the AI infrastructure. And that's kind of similar to how your PC or laptop will go access the internet just to watch this video. Now, for those that are steeped in the supercomputer realm, they'll definitely say, hey, the way that a lot of folks make clusters is they use technologies like InfiniBand or other exotic interconnects. While those fabrics often work for the world's supercomputers, the world's gargantuan networks run on Ethernet. And that's one of the reasons that they're using it here because they don't need to just scale to the size of a supercomputer. They need to scale to a massive AI cluster. Now, of course, this is not the same Ethernet that you have in your PC or notebook or something like that. It's much faster, probably something like 400 times faster. But NVIDIA has some other processes going along into this. Behind me, we have the NVIDIA SN 5600, which is a 64 port, 800 gigabit Ethernet switch. And that means each one of these can be split and run 128 400 gigabit Ethernet links. And these NVIDIA Spectrum X switches, along with the Bluefield 3 DPUs, can do amazing things. These have a host of features and processing capabilities that allow the NVIDIA GPUs and the entire cluster to run at their maximum performance levels. The NVIDIA solution can do things like offload various security protocols and has advanced flow management to help ensure that you don't have a congested network. The other thing you can do, though, is you can maintain a flow of data and packets throughout the entire cluster and help make sure that things get to the right place at the right time. And here, it could be used not just for the RDMA network for the GPUs, but also for things like providing storage. As you can probably tell as they blend in with single mode fiber with my yellow shirt, there's an absolute ton of optics and fiber and stuff running throughout this entire building to make sure that the communication happens efficiently and fast. Now, these over here are the north-south switches. In a modern AI cluster, the east-west traffic pattern is usually dominant. Still, these fancy high-end switches can handle a ton of 400 gig ethernet connections for north-south traffic, just like the other switches that we looked at for east-west. These are not being used for the RDMA network, which is a fast network that the GPUs require. Instead, this is being used for all of the other work, all of the other supercomputer tasks in the cluster. But these switches are another 64-port, 800-gigabit ethernet switch, which is just really cool. It's a really high-end system, and this is definitely one of the first deployments for this type of switch in the world. Now, with a large-scale AI cluster like this, storage is delivered a little differently than you would be used to in something like your desktop, your notebook, tablet, all that kind of stuff where you have local storage. Instead, the vast majority of storage is delivered over the network. The reason for this is that this type of AI training needs tons of storage, and so you can't really fit it in each of those GPU servers. And also, all of the GPU and CPU servers need access to all of that storage, and so that's the reason there's a giant storage cluster here. Now, with any liquid-cooled data center, [00:12:29] Speaker 1: a big part of that is, of course, the liquid cooling. And if you look around me right now, what you're going to see is these just absolutely giant pipes. [00:12:39] Speaker 2: I mean, these things are huge. And what these pipes do is they take liquid or water from the outside that's generally cooler, and they bring it inside the facility. It gets distributed into the different data halls, and then from there, it goes into the CDUs, which you saw, that's where all of the racks have their GPUs and all that kind of stuff. All that heat from all those GPU servers goes into the CDUs, gets exchanged to these racks. It comes back out as warm water, and at that point, it can go outside to a chiller. Now, these chillers are not made for things like, you know, making ice cubes or anything like that. They just lower the temperature of the water by a couple of degrees, and then that water can get recirculated at a cooler temperature, and the whole process can cycle over and over again. That's how data centers like this are able to reuse water. And by the way, these pipes are absolutely huge, and I can certainly feel the water flowing through here right now. Now, one other amazing innovation here is really what's next to me, which is the Tesla Megapacks that actually power the training jobs that are in this facility. What they found was that there are these little millisecond variations in power when all of these GPUs start training something, and when that happens, that was causing all kinds of problems with the power infrastructure. So the answer was to have all of the input power from generators and what have you generate power that goes into the batteries and have the batteries discharge and power the training jobs. Now, of course, [00:14:02] Speaker 1: that's the kind of engineering challenge that you have to solve for when you're building something of this scale. Now, of course, what you're seeing here today is really like a phase one of this entire cluster. This thing is already the largest AI training cluster in the world, and they are still building, which is absolutely amazing. Now, of course, a project like this takes a ton of people. I just want to say thank you, of course, to our team, [00:14:24] Speaker 2: but also the Supermicro team, the XAI team, and everybody else that's been involved in making this happen. Now, of course, if you like this video and all of this cool AI infrastructure and you just want to look for a job, there is a career page that you can definitely check out and see if, you know, something there piques your interest because this seems like the project that would be awesome to work on. And hey, if you did like this video of the Colossus Supercomputer powered by Supermicro servers, well, why don't you share it with your friends and colleagues? But also, give this video a like, click subscribe, and turn on those notifications so you can see whenever we come out with great new videos. As always, thanks for watching and have an awesome day!

Related Transcripts from ServeTheHome

Transcribe Any Video or Podcast — Free