Keynote: AI Data Center Networking: Lessons from Meta's Evolution

[00:00:00] I would like to introduce Omar Baldonado, who will be presenting AI Data Center Networking: [00:00:12] Lessons from Meta's Evolution. Omar leads the groups that develop and operate Meta's global [00:00:18] data center networks. Omar has been in networking since the early 1990s. I was not even born. [00:00:27] And it's a pleasure to have him speaking with us today. Welcome, Omar. [00:00:35] Thanks. Hi, everyone. Excited to be here. And this is a really exciting event for me personally, [00:00:49] because dating me to the early 90s, right, I've been in networking and seen a lot of the [00:00:55] growth as many as you all have. And it's particularly an honor to represent a lot of the work that our [00:01:03] teams have been doing. And hopefully, you know, here at the end, engage in some questions. And I really [00:01:08] wanted to get some feedback and get into it here at the end. So I'm going to go through some of our [00:01:14] our journey so far. As we said, I'm part of the infrastructure team at Meta. We run the network [00:01:24] infrastructure. And in particular, I focus on the data center. There's the backbone network. There's [00:01:31] really, we're a global network overall. And we'll see, increasingly, that is all in service of our AI needs. [00:01:40] So let's go a little bit about the AI context. It's not too much of a surprise, but I'm going to go over [00:01:45] here, especially from a meta perspective, so you get a sense, right? This is our current number of [00:01:51] folks that actively use some sort of meta product, whether WhatsApp or even Meta AI, Instagram, [00:02:00] three and a half billion users daily. And the number of messages that go across is staggering, right? [00:02:06] So if you look at 3.4 billion dailies, what does that mean from a network perspective, [00:02:13] right? This is not an AI flow. This is a regular sort of web flow. But pretty much every arrow is a [00:02:20] network connection, right? So that three and a half billion comes in, and it just explodes out into lots [00:02:28] of communication all across the infrastructure, which I think network operators and folks tied to network [00:02:36] would know. So related to the AI, Meta AI itself has now more than a billion monthly active users, [00:02:46] right? Whether it's within the core products or the wearables, right? Like I've got the new display [00:02:55] glasses here. I'm going to take them off in a little bit, right? Because this is tied. I can see messages [00:03:01] from my son or my family right now. And I'm not going to answer them during the talk. But you know, [00:03:06] just to give you a sense, like more and more, I depend upon these. You know, I go on a walk with my [00:03:12] dog in the morning. This is a store, this is the, this is one of my primary use cases. And I get a [00:03:17] personalized podcast. I can say, tell me about the thing, some topic. And I can go for a few minutes, [00:03:24] and it tells me, and then I interrupt it. And I say, wait a minute, tell me a little bit more about [00:03:29] that. And all of a sudden, my half hour walk with my dog is all done because I've just been kind of [00:03:33] learning and doing all this. So more and more AI will be pervasive, not just within the apps [00:03:40] in our phone, but in everything that we do. So a little bit about, again, this is just context setting, [00:03:49] just to give everyone here a sense of just how fast everything in AI has moved, right? So the Turing [00:03:55] test, which I think we're, we've, everyone knows about, but we have, we have well answered that, [00:04:01] I think now can, can a computer actually fool you into being a human. I think that happens all the time [00:04:08] now. And, but if you look at the, how quickly or not quickly, actually, it took a long time here [00:04:14] to get from that test to say, Deep Blue beating the world champion in chess, right? 50 years. [00:04:25] Then from Deep Blue, end of the 90s, up until the launch of ChatGPT, that was 25 years. And a lot of [00:04:32] the, you can see a lot of the advances in AI technology happened in the last couple of decades. [00:04:39] Since that launch, there's been this explosion for the last three years in models, in workloads, [00:04:47] and it's really caused the takeoff, right? So from 50 years to 25 years to the last three years, [00:04:54] and here we are beginning of 26, right? So that's the context. The rest of the time I'm going to spend [00:05:00] talking about why I believe this is a great time to be a network engineer. So I'm just going to take [00:05:06] these off now so that I don't get the messages. So the first part is, there is so much technology [00:05:17] for all of us to learn, right? It is an end-to-end problem. It's a full stack problem. [00:05:26] So I'm going to go here into the physical infrastructure, again, using Meta as an example. [00:05:32] Just, just to pause a little bit here. This is Nanog. So I'm assuming there are a lot of network [00:05:38] operators here. Can you, can you raise your hand? Who's, who has actively either run in, [00:05:44] in the past or is currently running a network? Okay. It's a good, it's good. This is actually, [00:05:51] when, when we go to other venues, this is, this is like the most dense operator group, [00:05:57] you know, in terms of number of hands. Other folks here, I'm guessing op vendors, [00:06:02] is that right? Let's see. So now, yeah, different, different. One of the, the big takeaways that I, [00:06:10] lessons here is that the operators need to work super closely with the vendors, more so than ever [00:06:18] with AI because of this full stack end-to-end problem, right? So let's go over it. So if I start at the [00:06:27] global level, right, Meta's infrastructure has thousands of locations, a couple hundred countries [00:06:34] around the world, this is for our caching and edge presence. Then in terms of, as a, as a global ISP, [00:06:43] we have hundreds of thousands of kilometers of subsea fiber as well as terrestrial fiber. [00:06:50] One of the things that, that recently closed here that we're very proud of is the To Africa project, [00:06:55] which we announced several years ago, completed the core infrastructure of it end of last year. And so [00:07:03] this connects dozens of countries together. It brings a lot more bandwidth to really 40% of the world's [00:07:11] population. Again, we build this for, for really the internet infrastructure as a whole, [00:07:18] and a lot of it is to better serve AI, better serve the users and get, just give them the bandwidth they need. [00:07:28] And then again, last part on the global infrastructure perspective, [00:07:32] we've now announced this Waterworth cable, which is even longer than To Africa. It is the, [00:07:37] it will be the longest subsea cable in the world, connecting pretty much all the, the major continents [00:07:44] and, um, and again, providing diverse paths, more connectivity to, um, to all of the world that uses, [00:07:52] um, that's using AI and using all of our applications. Okay. So that's the global perspective. [00:08:00] Now, if I zoom in a bit to the regional perspective, right. Um, these are the sizes of the [00:08:07] clusters that we've been working at, uh, Meta that are dedicated to AI and with particular AI trade, [00:08:14] large-scale AI training, which I'll get into a little bit in terms of how that's been changing as well. [00:08:21] You can see here, 2020 to 2023, um, we spent a lot of time just kind of building up the, [00:08:30] the clusters up until, um, uh, these 24K clusters. The 24K clusters are what we trained, um, an earlier [00:08:38] version of Llama and we had two, two 24K clusters. Then after that, we built the 129K GPU cluster. Um, [00:08:46] and that was used for initial versions of Llama 4. We built that in 2024. And then you can see how this [00:08:53] is progressing. Um, we have dozens of data center regions, uh, around the world. Most of them are in [00:09:00] North America. Um, but then you can also see an expanding presence there. One of the things with AI [00:09:09] that we've talked about is the amount of power that's required. And we announced two, two of our [00:09:14] multi-gigawatt locations, Prometheus and Hyperion. And just recently we announced the, the energy [00:09:23] of partnerships that we've been working on to, to support this large-scale, uh, build outs. [00:09:29] Right. So very, very excited about the, the new, about the, the various forms of energy that we're, [00:09:34] we're investing in, in particular, some of the clean energy is the nuclear energy that we're, we're doing [00:09:39] here. Now a little bit more on the networking in these regions. When we, when we look at something [00:09:47] like our, our, our gigawatt clusters, it's easiest if it's a green field, right? You're going to build a [00:09:55] huge campus with lots of buildings, all networked together. But some of these, you know, as we all [00:10:00] know, it's not green field. Most of the time it's not green field, it's brown field. So you have to go [00:10:04] into a place with different locations and then connect them up. So this is the, the trenching that's going [00:10:11] on of, uh, outside plant fiber, um, in our Prometheus cluster. And there are, there are millions of [00:10:18] kilometers going on here just because of the density of, of what we're trying to connect between the [00:10:22] buildings. Right. And in fact, another, a recent announcement that we went through was, uh, just last [00:10:28] week, I think we announced this deal with Corning in terms of securing the, the amount of fiber that we [00:10:35] have to work with. Right. Um, I'll get to the lessons, uh, a little bit more coming up, but hopefully this [00:10:42] gives you a sense, you know, we went from global to regional and a lot of those cons, uh, global obviously [00:10:49] has its, uh, uh, realities of geography and distances that you have to deal with as a network engineer. [00:10:56] Regional, you have to deal with things like space and power and fiber, um, especially at the higher [00:11:03] densities. Now zooming in inside of a building, this is, this is hopefully a little bit more, um, [00:11:13] familiar to you all in terms when you build fabrics inside the buildings. Um, we're going to talk about [00:11:19] scale out fabrics inside the buildings. Um, and that's where a lot of the GPUs connect in [00:11:26] and need to kind of have their very high, high bandwidth, high performance networking. [00:11:32] And so we've built a couple of different generations of fabrics. We've shared these [00:11:35] in other venues like open compute. Um, this particular one was something called or is something [00:11:42] called the disaggregated scheduled fabric. And essentially it took, we know if you've worked with, [00:11:49] chassis before router router, router back planes and router chassis. We took the scheduling that happens [00:11:55] between those cards and a chassis, took that algorithm and ran it across really a whole building [00:12:02] in order to, um, get all the GPUs hooked up effectively to what is a giant 20,000 port router. [00:12:09] Um, one of the advantages of this is if it handles a lot of different, uh, um, types of accelerators [00:12:18] and nicks. And then you can compose these into even larger, um, cluster sizes. [00:12:26] More recently, we announced a non-scheduled fabric. This is, this requires more tuning [00:12:32] because you don't have, you know, you just assume router back plane scheduling works like it does in DSF. [00:12:38] But now when you're, you're not scheduling it, you have to be much more aware of the whole network [00:12:44] problem and how you're tuning from the nicks and the, and the applications through the fabrics [00:12:49] and making sure that works. This is actually we've tuned to handle, uh, some of our largest AI cluster builds. [00:12:55] So that's a little bit about the fabrics, uh, scale out more on its way. You know, we're looking at, [00:13:01] there's, there's so much networking going on. We're looking at, you've talked, we've talked about [00:13:06] people have seen like liquid cooled AI systems. It's big enough. Now we have liquid cooled network systems. [00:13:12] So again, so much technology to, to be exposed to and learn. [00:13:19] All right. So we went from the global to the regional to the building. Now let's get inside the rack. [00:13:25] Right. The rack is where, from an AI perspective, the AI systems run. Most recently there's, as, um, as, as you've been tracking, [00:13:36] there's the NVIDIA GB 200 and GB 300 systems, uh, that are rack scale and more coming from, uh, AMD and other companies [00:13:45] that are building really rack scale systems. There is a lot of networking in here. You can see, uh, you can just see the cables. [00:13:54] Actually, sometimes, sometimes pictures show up and they don't show, they don't show the cables. I, I personally [00:14:00] take offense to that. I'm like, wait, you've got to show the networking. You got to show the purple or [00:14:05] the blue or the black. You got to show it all. Right. Um, cause that's where the networking is. [00:14:11] So there's a lot of networking in, in these, in these rack scale systems. It's the domain of, uh, [00:14:17] scale up fabrics, right? Very, very high performance. They're really at the, the, at the leading edge of [00:14:25] what are the, uh, the, the optics or the certies or the, just generally the bandwidth coming that can [00:14:30] be driven by GPUs. This is a very high performance, high density. Um, some of our work recently there, [00:14:37] you know, uh, we worked very closely on, on the ND link technologies, but then also we're, uh, working [00:14:44] on using ethernet for those, for that scale up fabric inside a rack. All right. So zooming in even [00:14:52] further, let's talk about the blades or the silicon. And again, hopefully you all get the sense. Networking [00:14:59] is in all of these places. This is what, this is part of what is so exciting. Networking down at the chip [00:15:04] layer, right? This is sometimes people call it the scale in network. Um, but how fast is, are the chips [00:15:13] able to talk to each other die to die interconnect? What is the speed of the chip and memory, right? Do I [00:15:20] disaggregate the memory and what's the latency of that, right? So, um, the design of, of the chips itself [00:15:28] is very, very dependent on, on networking. So, this is our own, uh, meta, uh, roadmap for, um, for our chips, [00:15:42] training and inference and, you know, different types of accelerators there. [00:15:47] So that kind of brought us all the way through, uh, from, from the global to the ASIC level. Just wanted [00:15:54] one, uh, one lesson here. It is not, it looks clean, like, oh, things get bigger. Um, and it's, [00:16:02] and this is just how things progress, double, double, double, double. I'll tell you for one of the [00:16:06] learnings from, uh, as I was talking to our, our teams, they said, it is not linear. There are so many [00:16:14] restarts, pivots. Wait, we thought we were going to go with this, go with that. Um, we tried this, [00:16:22] try that for a little bit and then we move, we move into something else. Um, I don't think it's [00:16:27] quite as chaotic as it was in 2020, 23, uh, you know, from a controller and, uh, routing and, uh, ECMP [00:16:36] design perspective, we must have run through half a dozen versions of topologies in that 2020 to 2023 [00:16:44] period. It's a little bit more stable now, but still, um, there's a lot to keep up with because [00:16:50] the things that drive the, the innovation, what drives the cycle now is how fast can you keep up [00:16:58] with the GPUs or the accelerators. They're pushing to put new product out every year. Multiple companies [00:17:05] are doing that. And so how do you keep track of that? And so that's why you want to stay, stay plugged [00:17:12] into the AI, uh, roadmaps and then find out how networking matches that. And that's again, one of [00:17:20] the, the reasons why it's a great time to be a network engineer because of this whole tied to all of this [00:17:27] whole stack of technology. So if I look at that stack, right, just, this was actually something that we went [00:17:32] through from a, uh, uh, just a scale numbers. It was the, it was boggling, right? We talk about chips [00:17:40] at the nanometer scale, but then we're also talking about millions of kilometers. And I don't know if [00:17:47] no one talks about what's a million kilometers. So one order, it's a, a megameter. It's a gigameter. [00:17:54] We have gigameters like, but we don't talk about gigameters, but we still have, we're talking about, uh, a [00:17:58] huge number of, um, of, uh, this, a huge range of distances. We talk about power, power optimization [00:18:08] within the chip, but also within the building is an important constraint for infrastructure designers, [00:18:16] but also the network is a part of that infrastructure. We have to understand [00:18:20] what power is being taken up by the, the switches, the optics or the, the connections, the nicks, [00:18:27] they all factor into this power. Um, so again, going from chip level, say, picajoules per bit, [00:18:34] um, all the way up to the gigawatts. And then finally, from a performance perspective, [00:18:39] it isn't so unreasonable for us to, to look at, um, getting to Zeta flops or even Yotta flops. I had [00:18:46] to go look up. I was like, wait, let me make sure. Um, but if you look at an H 100 GPU, roughly, [00:18:54] depending upon the, the accuracy, you know, FP4, FP8, you're talking about a petaflop. [00:19:01] If you have a million H 100s, that takes you to X and Zeta flop, right? And then with each generation of, [00:19:10] of improvement, let's say, you know, between H 100 and GB 200, there was an order of magnitude improvement. [00:19:16] So then you look at the chips themselves, let's say over three generations getting, uh, three orders [00:19:23] of magnitude faster. And then you're into this Yotta flop, uh, regime, right? So again, networking folks [00:19:33] have to embrace all this. This is, this is amazing, right? So that's the first lesson and why I think [00:19:38] it's exciting. There is so much for all of us to learn together, right? Um, one piece of advice, [00:19:45] don't learn just one layer or one component, right? You know, I'm the, I'm the fabric person that really [00:19:52] knows this fabric, or I'm the, I'm the optics person. I really know optics, you know, FR4, ZR4, DR4, [00:19:59] because this, these all have to work together. It behooves all of us, all of you to, to understand [00:20:10] different parts of the stack, end to end, and up and down. [00:20:16] That gets into the software side also. I haven't gone over the software yet. I'll go, I'll go into [00:20:20] that. That's why I think this is, this is the second reason why I think it's great to be a network [00:20:25] engineer. Networking, network engineers have always been the ones that pull it together [00:20:32] from an end to end solution. So if I look back and again, thank you for, you know, for dating myself. [00:20:41] I remember when one of the network engineers I worked with, uh, in the nineties, uh, Sean Finn, [00:20:50] and he, he came and he said like, I've got this network general sniffer. [00:20:55] I'm sort of curious who's used the network general sniffer. Shouldn't, it should be a little smaller [00:21:00] group. I've got this network general sniffer and there's this port 80 stuff. Thank you for raising [00:21:05] your hands. So, um, there's this port 80 stuff going on. We got, you know, it's showing up. We, [00:21:11] we got to learn about it. Okay. So we started, okay, what web? Okay. What's the sort of technology [00:21:16] going on there? So as network engineers, we had to, we'd kind of like, Hey, what's the application [00:21:22] trying to do there? And then of course the nineties saw, okay, now we're doing, uh, online, [00:21:27] you know, E, E trading. We're doing all this stuff on the web. Um, so we learned about that application, [00:21:36] right? Look at the next one. Um, streaming, right? I remember working with folks, um, early 2000s, [00:21:46] trying to understand this MP3 streaming. There was a company, real networks. We were trying to like, [00:21:51] well, this is pre YouTube, right? Pre YouTube. And like, okay, what, what is that? Right. And so then [00:21:57] we all started learning about playback buffers and jitter sensitivity and loss, right? So to make [00:22:03] the network work for the application, yeah, we all learned about, uh, about that, right? Why was jitter [00:22:11] important? Um, and we all felt it at the time when, when our videos would freeze, we still feel it, by the way, [00:22:18] it's always the network. But that's when we really learned about this sort of real-time video and [00:22:25] real-time streaming and the performance that the network can provide and how it impacts the application. [00:22:32] So here we are today. I mean, I can go through the different technologies, mobile phones, iPhone, [00:22:36] you know, that, that was also another, uh, had implications to, to network engineering and design. [00:22:43] Um, but now here we are with AI and that's what, um, again, just like we did it for the, the port 80 [00:22:50] that we were seeing on the sniffers, just like we did it for streaming. We have to bring these, [00:22:56] these various components together and make it work for, um, for AI, right? [00:23:02] The role of a, of networking for AI, it seems so obvious. [00:23:12] But a lot of folks, when we think about AI models, a lot of the research just for, [00:23:18] for, for ease, people normally think of having one GPU. [00:23:23] They can train a model, uh, on one GPU, they can serve a model on one GPU. [00:23:31] And that's where a lot of people have started because most people don't have access to very large [00:23:36] GPU clusters, right? I, we throw the numbers up 128 GPUs, you know, those are big numbers. Like most [00:23:42] people don't have access to that many GPUs. But once you start leaving a GPU one, you hit the network. [00:23:49] Now a lot of it is hidden inside the, um, the, the system, say with, with the H100 systems, [00:23:55] you had five G or eight GPUs that was kind of hidden in there. But once you start breaking out into [00:24:01] the rack, that is where network engineers play. And that's where we were critical. [00:24:06] Let's go over a little bit about the software. Um, again, I'm not going to go into a lot of, [00:24:12] uh, too, too much detail here, but it's trying to give you the same analogy, just like we learned [00:24:16] about jitter and playback buffers, just like we learned about, you know, how many turns does it [00:24:22] take to do an SSL authentication? And can I reduce those number of turns so that, you know, I have, [00:24:27] I have faster web, uh, connectivity, learning about the AI application and how it's trying to leverage [00:24:34] the network helps us do better job in supporting the application. So one of the techniques I mentioned, [00:24:42] when, you know, when you talk about multiple GPUs, how they're going to talk to each other, [00:24:46] there are different things that, that there are different ways that the model, uh, designers [00:24:50] think of doing that, right? You don't want to multiply, uh, a row of them, you know, you want to do [00:24:57] one row of matrix multiplication. You don't want to do that over the WAN. [00:25:01] Just, uh, hopefully is, is, uh, because there's so many of those matrix multiplies. You want to do that [00:25:08] in as fast, you know, ideally you don't hit, uh, any of the fabrics as fast as possible. Um, and that's [00:25:14] called tensor parallel, right? And so when, when you work with model designers, and this is something [00:25:21] that we, a lot of our network engineers learned over the, the last few years, the model designers [00:25:27] are used to working on one GPU or working in one system. I mean, this was in, this is four or five [00:25:32] years ago, but now I think most model designers know, hey, there's a network. Um, but they, they want [00:25:39] to understand, they need to know, and we can engage with them like, hey, you know, this type of operation [00:25:44] you're trying to do, let's say a fundamental matrix multiply, there is speed of light. I can't do, [00:25:49] I shouldn't do that over the WAN, so let's do this this way in, in this part of the network. [00:25:53] And then the other parallelisms that they're trying, that were the models trying to do, [00:25:58] let's say data parallel, you're going to show, it's basically sharding of the, of the, of the model, [00:26:03] the data that the model can be trained on. That I can do over more, over longer distances, [00:26:08] over across scale out or even, uh, the WAN, well, you know, from a fabric perspective. [00:26:14] So understanding and, and being the representative of the network, the capabilities of the network, [00:26:19] I can get, I can, this is the type of networking I can do within RAC, within building, between buildings, [00:26:26] and let's work together on how to, to make that scale and work well for models. That's the role of a [00:26:32] networking engineer. Not all models are alike, right? I think we, and not all, sorry, not all models, [00:26:42] not, not all models are alike, that's right, but not all workloads, not all AI workloads are, are the same. [00:26:48] So you look at training, we talk about training, so many flavors of training. Training for a large [00:26:55] language model is different than training for, uh, for ranking and recommendation, R and R, right? [00:27:03] Training, there's also the world of like a pre-large foundational model training, and then there's [00:27:08] post-training, kind of where you're really trying to optimize the model on a particular, uh, use case, [00:27:14] like coding or image recognition. And then inference itself, inference itself is somewhat dependent on the [00:27:21] model, right? If your inference model is so, if the model you're trying to serve up is so big that it [00:27:28] doesn't fit on one GPU system, you may want to break it up, right? So now one stage of inference, pre-code, [00:27:36] may be on one set of servers, and then the decode part of inference is on a different set of servers. [00:27:43] Now, there's a lot here, and that, um, you may not know, depending upon your, your exposure to all this, [00:27:52] it takes some time to kind of understand the different, uh, models and workloads and requirements, [00:27:57] but again, this is something that, that we have to do, because the, the network is fundamental in this. [00:28:04] If I look at the, the dimensions right here in this graph, right? I haven't actually, I haven't explained the [00:28:08] graph here, um, just kind of, for the different types of workloads, what is most important to them? [00:28:16] So pre-fill, you can see the orange, is very compute intensive, right? If pre-fill, just a quick [00:28:24] pre-fill decode, pre-fill is like, I'm trying to think of what I'm going to respond, so I take some [00:28:28] time to think, then decode is, I'm going to tell you what I, what I just thought about, right? So when you [00:28:33] ask somebody a question, like, they may sit there a little bit, think, and then, then they tell you the [00:28:40] answer, and hopefully they're not pausing in the, in between the answer, they're just, they're just [00:28:43] letting it come out, right? So the compute part of inference, oh, sorry, the pre-fill part of inference, [00:28:48] compute, um, is very compute intensive. The decode, which is the red, you can see network latency [00:28:56] sensitivity, right? So, because you don't want the jitter, or the, you don't want the pauses in the [00:29:02] speaking, in the response being served up to you, you want it to be just very smooth and coming out [00:29:07] as quickly as possible, right? So this is, this is the translation, sometimes people talk about like, [00:29:11] time to first token, um, as a, as a model evaluation criteria, how quickly can you get that first token [00:29:20] is really, uh, like a compute problem, and then, or a pre-fill problem, how do you serve that up [00:29:26] is a decode problem? And again, this is, this is, this reminds me again of, of all those other times, [00:29:32] like, oh my gosh, I got to learn about streaming, I got to learn, I'm going to get to learn about [00:29:36] web, um, this is what you get to learn now as a network engineer. [00:29:42] And there's a lot more, I actually, that graph doesn't have, I just throw up a couple of models, [00:29:46] there's mixture of experts, which implies a different sort of, uh, parallelism that you can do on, [00:29:51] an expert parallelism on the, on the infrastructure. There's reinforcement learning. Reinforcement [00:29:57] learning has a whole bunch of, of servers talking to each other, and it is like, more like a traditional [00:30:02] distributed system, right? There's even work that's going on at the software layer to say, um, go and [00:30:11] make sure that the traffic between the GPU and CPU are minimized, right? So in a traditional, uh, AI system, [00:30:21] there's a lot of coordination between the CPU and GPU, because the CPU kind of does all the planning [00:30:27] kind of, and then it kind of sets it up for the GPU to just go, go compute very fast. But it introduces [00:30:33] extra turns between this, between the CPU and the GPU. And, and reducing the number of, you know, turns, [00:30:41] right? Reducing that communication and just having it being very GPU centric increases efficiency, [00:30:47] right? Um, and again, I alluded to it earlier. It reminds me of when we were trying to do some of the, [00:30:52] um, the SSL handshake work again in networking, when you're trying to do, you know, this is now internet [00:31:01] distances and latencies. If you're in, if you're in a part of Asia and you're connecting to a data center [00:31:08] across an ocean, every round trip in a handshake for, let's say SSL is expensive. So you want to reduce [00:31:15] those turns. Again, network engineers familiar with that. This is the same problem, right? I don't, [00:31:22] I mean, it's just as much, it's happening in a much smaller context and it's happening, you know, [00:31:28] millions of times, uh, during a workload, but it's trying to optimize the CPU to GPU communication [00:31:35] or, or minimize it, right? Again, from, uh, um, there, there's ways to get involved here at the [00:31:41] software layer. We announced some of our, our library work with PyTorch to, to provide a more scalable and [00:31:49] flexible framework for this GPU to GPU communication. And actually we open sourced a lot of it. Uh, one of [00:31:55] the things I'm really proud of that the team, I think excites the team to be able to share our, [00:32:00] our work. We've shared how we we've made a rocky scale over the last several years. Um, we're sharing [00:32:09] how that the software layer were, were, and we've even open sourced some of the code in terms of what [00:32:16] needs to happen to make a hundred thousand plus GPU cluster work, right? Last thing in terms of, um, [00:32:23] the, the, some of the more software, then I'll get to this, uh, this integration part. Just wanted to [00:32:30] map LLM because there's a lot of different pieces we're talking about. What does it mean for an LLM [00:32:34] to work on the infrastructure? Right? So just trying to break it down, right? Pre and post training [00:32:39] of LLM uh, four happened on the 129,000 cluster. And now it's happening on GB 200 clusters, um, in our [00:32:48] gigawatt cluster. And so that pre and post training happens. There's a lot of capacity there. It's all [00:32:54] running on ethernet rocky. But again, that's the training part, different stages of training. Inference [00:33:02] happens because we have this billion active users for ourselves, but everyone's trying to serve it up [00:33:07] as quickly and as broadly as possible. Inference happens on not 129,000 machines, you know, hopefully [00:33:14] it doesn't require that. Inference happens in a small number of machines, either one machine or a couple [00:33:19] machines. Um, cause you want to really try to, that's, that's where the numbers come, right? Training a [00:33:25] model may take 120, a million GPUs and they run for a month. [00:33:32] Um, but then you want to be serving a billion users just as fast as possible. And that, that's like a web, [00:33:38] a web scale latency, uh, workload that you really want to kind of watch. Um, and that's kind of the [00:33:46] serving inference part of it. And we use different, what's good for large scale training may not be good for, [00:33:53] for inference. Um, but you want to kind of have fungibility. So you're not totally committing your [00:33:59] infrastructure to just one, uh, this chip or this cluster is only good for this, but, [00:34:04] but you have to then be aware of what can this cluster do. Okay. [00:34:10] So that's networking pulls it together, right? Network engines pull together. [00:34:15] The technologies have to work. So you have to understand the performance of the application. [00:34:20] Then operationally you have to debug it, right? You have to run it. Um, [00:34:25] is it the network common question, right? If, if you look at, if you look at again, [00:34:30] falling back at the model developers, they'll say like, look, when I'm pro, when I'm programming, [00:34:37] there's a very clear fork in, in the pie torch in the model code. Am I talking to a local GPU? [00:34:43] Or am I talking not to a local GPU? That that's it, right? And so anything that happens to talking [00:34:49] to a, not a local GPU looks like a network problem. So it's like, I'm talking to my local GPU, all [00:34:56] hundred thousand or anything else that's happening looks like network. So we get this question a lot, [00:35:01] but is it a network and we have to, we have to debug the components. There are millions of components [00:35:06] in one of these large clusters, billions across the fleet. We have to work on auto diagnosis, [00:35:12] um, correlation techniques. So, and again, this is because of the application knowledge. [00:35:18] That's what lets us do that, right? As a network engineer, um, network engineers also, again, this is, [00:35:26] this slide is just to emphasize the, the non greenfield nature of the problem. In this picture, [00:35:34] there are four generations of buildings that are coming together into our gigawatt cluster, [00:35:40] right? The H, the H style buildings are, are, are, uh, an older generation. The ones at the bottom [00:35:47] are liquid cooled facilities. So all those little dots in the bottom are, uh, are cooling units that [00:35:54] cooled buildings. So there's like closed loop cooling that goes out. And then there's one of our sprung [00:36:00] structures in here, kind of next to the middle H. There's, there's a, there's a tent, uh, which we [00:36:06] had to stand up super quickly. Uh, and then the buildings at the far right are, uh, not our buildings, [00:36:13] but ones that, that we've leased and, and have put, put our, uh, infrastructure inside. [00:36:20] Heterogeneous. So when we talk about the Prometheus cluster, it is not purpose-built from, [00:36:26] from the ground. It has multiple generations. And again, network, the networking is different in each [00:36:30] one of these. And then we're trying to network them all together. Is it, is a really good problem for [00:36:35] us? No, it's a challenging problem. Uh, it's an intense problem. Okay. Last thing. Why again, [00:36:44] do I think it's a great time? We can go faster with AI ourselves. I've been talking all this time about [00:36:49] building the network for AI, this is now the AI for networking, right? How can we all go faster? [00:36:57] So at previous nano talks, you know, some of them, some of my meta colleagues gave, uh, have talked [00:37:02] about how we automated away our knock. We don't have a knock. Um, we figured out the thresholds and rules. [00:37:08] We wrote code to detect anomalies, take automatic correction, right? We've also automated some of [00:37:16] the config generation, you know, through intent-based networking. And then, um, there's a whole system [00:37:22] to push and go update. But in both of those cases, we had to write a lot of software. And so network [00:37:29] engineers were paired up with, with software developer teams. Now the network engineers wrote a lot of [00:37:35] software. Perl, Python, Go, C++, everyone's writing a lot of software, right? But certain systems you had [00:37:43] to pair up, uh, and there was a dependency with, um, with software engineers. Even if I want a dashboard, [00:37:50] I had to go talk to a react UI person to kind of build me a cool dashboard that I want, right? But in [00:37:56] this AI era, you can quickly develop, companies can quickly develop the software that we need. I used to, [00:38:02] I've talked to a number of, uh, operating companies to say, can we, how do we kind of go faster with [00:38:07] software development? And one of the blockers in the past has been, we don't have a large software team. [00:38:14] You know, you don't need a large software team for a lot of this. Some of this you do, [00:38:17] but pair up network engineers or network engineers can go, go just use the technologies themselves, go use, [00:38:26] um, uh, an agent to help you. Work with software engineers who are also using those same agents, [00:38:33] um, and then just work through it. A lot of what I hear sometimes somebody will go like, [00:38:39] you know, I, I asked the AI to generate a fabric and it didn't look good. [00:38:45] I know. Okay. Not ready, right? Maybe it's ready for that coding stuff or it's ready for whatever image [00:38:52] generation, but it's not ready to generate my fabric definition, right? I get it. It's probably, [00:38:58] there's probably not a lot of training for network fabric generation, but work with it. Explain the rules [00:39:04] that you have for how you want to write your fabric, explain the rules for how you want to fix things, [00:39:11] and automatically remediate things, and kind of iterate with it. You can't one-shot it. Anyone who tries to one-shot a prompt [00:39:17] with a, with an LLM, you, you're almost always frustrated by that. But if you work with it, [00:39:23] you can, you can really get a lot of good results. And this is something that we've been doing internally, [00:39:28] um, haven't been, we can't publish that, some of that work yet, [00:39:33] but it is definitely accelerating us. So, um, that's, that's, uh, that sums it up, right? I think [00:39:42] it is a great time to be a network engineer. Lots of, lots of technology. It leverages what we've been [00:39:49] doing, kind of our, the, the natives, the, the, the skills that we have of pulling together solutions, [00:39:54] and we can supercharge ourselves with AI. So, um, I think I've got a few minutes. Thank you very much [00:39:59] for listening, and hopefully we have a few minutes for questions. [00:40:07] We've got mics, if you've got questions, or you just yell it out, and then I'll repeat it, [00:40:13] if anyone has a question. [00:40:18] Okay, thanks. Hi, Ido Kadeem from CPacket. Um, you're from one of the few companies that really [00:40:26] run inference at huge scale. And, uh, you know, there's a lot of discussion about the design of [00:40:31] clusters. But I'm really interested in how you see the network at the edge of the clusters, [00:40:39] where all of these inference requests and needs come from all different places with all different [00:40:44] characteristics that must have a unique network implication as well. Yeah, yeah. So I think that [00:40:51] is, that is something that is, uh, that is coming, right? Because when you look at that global [00:40:58] infrastructure we have, and that many companies have, right, you want to move, you want to move as [00:41:03] much of it as close to the end user as possible. That's why we have caches all over the world, [00:41:08] we have POPs all over the world, and over time more and more of that, that edge serving will push [00:41:15] out to the edge, will, will get closer to the user. In some cases we look at it, can we do it on the [00:41:19] phone, right? But then if you can't do it on the phone, can you do that in a localized compute [00:41:24] facility geographically? And that is definitely, um, that's definitely coming. And it is because of a [00:41:31] fundamental networking problem there. Better to get to serve it up locally if possible. There is a [00:41:37] lot of complication there too. Um, different, uh, different countries have different rules about what [00:41:43] data is hosted where, that sort of thing. But, um, it is, it is a networking problem that we'll get to. [00:41:49] Now, and then we're working on. Oh, sorry. Yeah. We'll, we'll round Robin here. Uh, yeah, [00:41:56] actually I have a question from remote. Uh, so this is from Daryl Newcomb, who's, uh, okay. Perfect. [00:42:01] Unaffiliated. Uh, the question is, if I understood correctly, uh, with your description of DSF [00:42:06] as being straightforward and performant, what are the top primary drivers for having used non-DCF in your [00:42:12] largest footprint of clusters in a single region? Ah, okay. Good. Um, so just to, to recap there. So I [00:42:18] mentioned a couple of scale out fabric technologies, DSF and, uh, uh, uh, just disaggregated scheduled [00:42:24] fabric and a non-scheduled fabric, right? Um, the trade-offs, I think there was the question [00:42:32] disaggregated scheduled fabric just looks like one big, one big router to the application. And you [00:42:37] don't worry about performance usually going into and out of a router. You just, you rely on the back [00:42:42] plane to work. Um, but on the other hand, that does come with, with some overhead of scheduling to making [00:42:48] sure like, Hey, that output port that's on the other side of the building. I can, I can get a path and [00:42:53] it's, and it's ready to serve this up. So there is a lot of, uh, scheduling that goes on and that does [00:42:59] have some impact on performance. So, um, but it has flexibility. So when we moved to something like a [00:43:07] non-scheduled fabric, there's a lot more work then to make sure the applications from the application layers [00:43:12] through the NIC are properly tuned to work with, uh, a non-scheduled fabric. So that's the, that's the [00:43:20] real choice around that. Um, we have both different use cases for both. Um, but then when you talk about, [00:43:27] uh, extending the scheduling across very long distances, then it becomes, uh, you know, much harder to get [00:43:34] that, uh, across, you know, between buildings, uh, or across longer, uh, distances. Thanks. Thanks for the [00:43:41] question. Uh, I think this is our last one here probably. Okay. Thank you. Hi, uh, I'm Meg from Marvell. [00:43:47] Oh yes. Hey, good to see you. Really engaging talk. And I have a lot of questions, but I think I only have [00:43:53] time for one. So I was wondering if you can talk a little bit more about the, uh, decode for inference and the [00:44:00] memory bandwidth requirements and how that may impact the network. Right, right. So the question, [00:44:06] the decode, decode for inference, I think, and again, this is why I think it's exciting for, [00:44:13] for all of us in the room to understand, like, what does, you know, that's a little bit like [00:44:17] understanding why, why was the, the playback buffer different, you know, why was it better in YouTube [00:44:26] versus in this or what, you know, Netflix streaming uses TCP, but Google streaming uses quick UDP. So [00:44:33] it's really, you're really getting into this deep understanding of what the application is trying [00:44:37] to do. So decode specifically, um, has a set of requirements in terms of, um, making sure the [00:44:46] latency is low enough. So, and what, whereas pre, pre-fill has a much higher, you know, say memory [00:44:54] requirements. Right. So in a way, I would say it's not so much specialized just to decode, [00:45:01] but it's more like what, what does a pre-fill server have to do that a decode server doesn't. [00:45:07] Right. And how much of the model needs to be available, how much of the data needs to be available [00:45:12] for that pre-code step. And you can optimize it not to have, not to require that in decode. Right. [00:45:18] Right. Um, and so then you get into questions that are, do I buy dedicated decode servers versus pre-fill [00:45:24] servers? That's a very optimized problem for, for large scale infrastructure. Um, [00:45:32] it may not be generally applicable, but you know, I'm happy to get into it. But if you think like, [00:45:37] um, I think of what, what I see a lot of, uh, other companies doing right now, maybe, um, depending on the [00:45:44] size of their infrastructure, maybe their question is like, do I get inference servers versus training [00:45:50] servers? Right. General GPU servers for training. Do I get more dedicated inference servers? That's [00:45:55] probably where I see a lot of the, the infrastructure debates right now. Um, understanding also though, [00:46:02] that if you get things that are super optimized for inference and you need to shift around your, [00:46:08] your, uh, infrastructure, if I need more training, all of a sudden, I can't just move, [00:46:13] move the training workloads onto these dedicated inferencer. So I think, um, happy to go into it more [00:46:19] offline with you, but I think, um, more general purpose when you think about infrastructure, [00:46:26] you're trying to weigh, uh, am I, am I getting the, the most out of the, the fastest performance, [00:46:32] but then am I also getting a certain amount of optionality and fungibility? Because you don't [00:46:36] want to just commit, you know, large amounts of infrastructure can only do one thing. Um, [00:46:42] you're always trying to balance how much flexibility you need. [00:46:45] So that, that trade-off may be different for different companies. [00:46:49] And we can chat more afterwards. Thank you all very much. I'll be hanging out, uh, happy to chat, [00:46:55] uh, enjoy the rest of the time. And thanks for the invite and letting us talk here. And again, [00:47:00] this is work on behalf of all the teams at Meta, many of them who are here today. So hopefully just [00:47:05] connect with me or connect with them. We're happy to chat more. Thank you.

Related Transcripts from NANOG

Transcribe Any Video or Podcast — Free