About this transcript: This is a full AI-generated transcript of Keynote: AI Data Center Networking: Lessons from Meta's Evolution from NANOG, published June 3, 2026. The transcript contains 7,823 words with timestamps and was generated using Whisper AI.
"I would like to introduce Omar Baldonado, who will be presenting AI Data Center Networking: Lessons from Meta's Evolution. Omar leads the groups that develop and operate Meta's global data center networks. Omar has been in networking since the early 1990s. I was not even born. And it's a pleasure..."
[00:00:00] I would like to introduce Omar Baldonado, who will be presenting AI Data Center Networking:
[00:00:12] Lessons from Meta's Evolution. Omar leads the groups that develop and operate Meta's global
[00:00:18] data center networks. Omar has been in networking since the early 1990s. I was not even born.
[00:00:27] And it's a pleasure to have him speaking with us today. Welcome, Omar.
[00:00:35] Thanks. Hi, everyone. Excited to be here. And this is a really exciting event for me personally,
[00:00:49] because dating me to the early 90s, right, I've been in networking and seen a lot of the
[00:00:55] growth as many as you all have. And it's particularly an honor to represent a lot of the work that our
[00:01:03] teams have been doing. And hopefully, you know, here at the end, engage in some questions. And I really
[00:01:08] wanted to get some feedback and get into it here at the end. So I'm going to go through some of our
[00:01:14] our journey so far. As we said, I'm part of the infrastructure team at Meta. We run the network
[00:01:24] infrastructure. And in particular, I focus on the data center. There's the backbone network. There's
[00:01:31] really, we're a global network overall. And we'll see, increasingly, that is all in service of our AI needs.
[00:01:40] So let's go a little bit about the AI context. It's not too much of a surprise, but I'm going to go over
[00:01:45] here, especially from a meta perspective, so you get a sense, right? This is our current number of
[00:01:51] folks that actively use some sort of meta product, whether WhatsApp or even Meta AI, Instagram,
[00:02:00] three and a half billion users daily. And the number of messages that go across is staggering, right?
[00:02:06] So if you look at 3.4 billion dailies, what does that mean from a network perspective,
[00:02:13] right? This is not an AI flow. This is a regular sort of web flow. But pretty much every arrow is a
[00:02:20] network connection, right? So that three and a half billion comes in, and it just explodes out into lots
[00:02:28] of communication all across the infrastructure, which I think network operators and folks tied to network
[00:02:36] would know. So related to the AI, Meta AI itself has now more than a billion monthly active users,
[00:02:46] right? Whether it's within the core products or the wearables, right? Like I've got the new display
[00:02:55] glasses here. I'm going to take them off in a little bit, right? Because this is tied. I can see messages
[00:03:01] from my son or my family right now. And I'm not going to answer them during the talk. But you know,
[00:03:06] just to give you a sense, like more and more, I depend upon these. You know, I go on a walk with my
[00:03:12] dog in the morning. This is a store, this is the, this is one of my primary use cases. And I get a
[00:03:17] personalized podcast. I can say, tell me about the thing, some topic. And I can go for a few minutes,
[00:03:24] and it tells me, and then I interrupt it. And I say, wait a minute, tell me a little bit more about
[00:03:29] that. And all of a sudden, my half hour walk with my dog is all done because I've just been kind of
[00:03:33] learning and doing all this. So more and more AI will be pervasive, not just within the apps
[00:03:40] in our phone, but in everything that we do. So a little bit about, again, this is just context setting,
[00:03:49] just to give everyone here a sense of just how fast everything in AI has moved, right? So the Turing
[00:03:55] test, which I think we're, we've, everyone knows about, but we have, we have well answered that,
[00:04:01] I think now can, can a computer actually fool you into being a human. I think that happens all the time
[00:04:08] now. And, but if you look at the, how quickly or not quickly, actually, it took a long time here
[00:04:14] to get from that test to say, Deep Blue beating the world champion in chess, right? 50 years.
[00:04:25] Then from Deep Blue, end of the 90s, up until the launch of ChatGPT, that was 25 years. And a lot of
[00:04:32] the, you can see a lot of the advances in AI technology happened in the last couple of decades.
[00:04:39] Since that launch, there's been this explosion for the last three years in models, in workloads,
[00:04:47] and it's really caused the takeoff, right? So from 50 years to 25 years to the last three years,
[00:04:54] and here we are beginning of 26, right? So that's the context. The rest of the time I'm going to spend
[00:05:00] talking about why I believe this is a great time to be a network engineer. So I'm just going to take
[00:05:06] these off now so that I don't get the messages. So the first part is, there is so much technology
[00:05:17] for all of us to learn, right? It is an end-to-end problem. It's a full stack problem.
[00:05:26] So I'm going to go here into the physical infrastructure, again, using Meta as an example.
[00:05:32] Just, just to pause a little bit here. This is Nanog. So I'm assuming there are a lot of network
[00:05:38] operators here. Can you, can you raise your hand? Who's, who has actively either run in,
[00:05:44] in the past or is currently running a network? Okay. It's a good, it's good. This is actually,
[00:05:51] when, when we go to other venues, this is, this is like the most dense operator group,
[00:05:57] you know, in terms of number of hands. Other folks here, I'm guessing op vendors,
[00:06:02] is that right? Let's see. So now, yeah, different, different. One of the, the big takeaways that I,
[00:06:10] lessons here is that the operators need to work super closely with the vendors, more so than ever
[00:06:18] with AI because of this full stack end-to-end problem, right? So let's go over it. So if I start at the
[00:06:27] global level, right, Meta's infrastructure has thousands of locations, a couple hundred countries
[00:06:34] around the world, this is for our caching and edge presence. Then in terms of, as a, as a global ISP,
[00:06:43] we have hundreds of thousands of kilometers of subsea fiber as well as terrestrial fiber.
[00:06:50] One of the things that, that recently closed here that we're very proud of is the To Africa project,
[00:06:55] which we announced several years ago, completed the core infrastructure of it end of last year. And so
[00:07:03] this connects dozens of countries together. It brings a lot more bandwidth to really 40% of the world's
[00:07:11] population. Again, we build this for, for really the internet infrastructure as a whole,
[00:07:18] and a lot of it is to better serve AI, better serve the users and get, just give them the bandwidth they need.
[00:07:28] And then again, last part on the global infrastructure perspective,
[00:07:32] we've now announced this Waterworth cable, which is even longer than To Africa. It is the,
[00:07:37] it will be the longest subsea cable in the world, connecting pretty much all the, the major continents
[00:07:44] and, um, and again, providing diverse paths, more connectivity to, um, to all of the world that uses,
[00:07:52] um, that's using AI and using all of our applications. Okay. So that's the global perspective.
[00:08:00] Now, if I zoom in a bit to the regional perspective, right. Um, these are the sizes of the
[00:08:07] clusters that we've been working at, uh, Meta that are dedicated to AI and with particular AI trade,
[00:08:14] large-scale AI training, which I'll get into a little bit in terms of how that's been changing as well.
[00:08:21] You can see here, 2020 to 2023, um, we spent a lot of time just kind of building up the,
[00:08:30] the clusters up until, um, uh, these 24K clusters. The 24K clusters are what we trained, um, an earlier
[00:08:38] version of Llama and we had two, two 24K clusters. Then after that, we built the 129K GPU cluster. Um,
[00:08:46] and that was used for initial versions of Llama 4. We built that in 2024. And then you can see how this
[00:08:53] is progressing. Um, we have dozens of data center regions, uh, around the world. Most of them are in
[00:09:00] North America. Um, but then you can also see an expanding presence there. One of the things with AI
[00:09:09] that we've talked about is the amount of power that's required. And we announced two, two of our
[00:09:14] multi-gigawatt locations, Prometheus and Hyperion. And just recently we announced the, the energy
[00:09:23] of partnerships that we've been working on to, to support this large-scale, uh, build outs.
[00:09:29] Right. So very, very excited about the, the new, about the, the various forms of energy that we're,
[00:09:34] we're investing in, in particular, some of the clean energy is the nuclear energy that we're, we're doing
[00:09:39] here. Now a little bit more on the networking in these regions. When we, when we look at something
[00:09:47] like our, our, our gigawatt clusters, it's easiest if it's a green field, right? You're going to build a
[00:09:55] huge campus with lots of buildings, all networked together. But some of these, you know, as we all
[00:10:00] know, it's not green field. Most of the time it's not green field, it's brown field. So you have to go
[00:10:04] into a place with different locations and then connect them up. So this is the, the trenching that's going
[00:10:11] on of, uh, outside plant fiber, um, in our Prometheus cluster. And there are, there are millions of
[00:10:18] kilometers going on here just because of the density of, of what we're trying to connect between the
[00:10:22] buildings. Right. And in fact, another, a recent announcement that we went through was, uh, just last
[00:10:28] week, I think we announced this deal with Corning in terms of securing the, the amount of fiber that we
[00:10:35] have to work with. Right. Um, I'll get to the lessons, uh, a little bit more coming up, but hopefully this
[00:10:42] gives you a sense, you know, we went from global to regional and a lot of those cons, uh, global obviously
[00:10:49] has its, uh, uh, realities of geography and distances that you have to deal with as a network engineer.
[00:10:56] Regional, you have to deal with things like space and power and fiber, um, especially at the higher
[00:11:03] densities. Now zooming in inside of a building, this is, this is hopefully a little bit more, um,
[00:11:13] familiar to you all in terms when you build fabrics inside the buildings. Um, we're going to talk about
[00:11:19] scale out fabrics inside the buildings. Um, and that's where a lot of the GPUs connect in
[00:11:26] and need to kind of have their very high, high bandwidth, high performance networking.
[00:11:32] And so we've built a couple of different generations of fabrics. We've shared these
[00:11:35] in other venues like open compute. Um, this particular one was something called or is something
[00:11:42] called the disaggregated scheduled fabric. And essentially it took, we know if you've worked with,
[00:11:49] chassis before router router, router back planes and router chassis. We took the scheduling that happens
[00:11:55] between those cards and a chassis, took that algorithm and ran it across really a whole building
[00:12:02] in order to, um, get all the GPUs hooked up effectively to what is a giant 20,000 port router.
[00:12:09] Um, one of the advantages of this is if it handles a lot of different, uh, um, types of accelerators
[00:12:18] and nicks. And then you can compose these into even larger, um, cluster sizes.
[00:12:26] More recently, we announced a non-scheduled fabric. This is, this requires more tuning
[00:12:32] because you don't have, you know, you just assume router back plane scheduling works like it does in DSF.
[00:12:38] But now when you're, you're not scheduling it, you have to be much more aware of the whole network
[00:12:44] problem and how you're tuning from the nicks and the, and the applications through the fabrics
[00:12:49] and making sure that works. This is actually we've tuned to handle, uh, some of our largest AI cluster builds.
[00:12:55] So that's a little bit about the fabrics, uh, scale out more on its way. You know, we're looking at,
[00:13:01] there's, there's so much networking going on. We're looking at, you've talked, we've talked about
[00:13:06] people have seen like liquid cooled AI systems. It's big enough. Now we have liquid cooled network systems.
[00:13:12] So again, so much technology to, to be exposed to and learn.
[00:13:19] All right. So we went from the global to the regional to the building. Now let's get inside the rack.
[00:13:25] Right. The rack is where, from an AI perspective, the AI systems run. Most recently there's, as, um, as, as you've been tracking,
[00:13:36] there's the NVIDIA GB 200 and GB 300 systems, uh, that are rack scale and more coming from, uh, AMD and other companies
[00:13:45] that are building really rack scale systems. There is a lot of networking in here. You can see, uh, you can just see the cables.
[00:13:54] Actually, sometimes, sometimes pictures show up and they don't show, they don't show the cables. I, I personally
[00:14:00] take offense to that. I'm like, wait, you've got to show the networking. You got to show the purple or
[00:14:05] the blue or the black. You got to show it all. Right. Um, cause that's where the networking is.
[00:14:11] So there's a lot of networking in, in these, in these rack scale systems. It's the domain of, uh,
[00:14:17] scale up fabrics, right? Very, very high performance. They're really at the, the, at the leading edge of
[00:14:25] what are the, uh, the, the optics or the certies or the, just generally the bandwidth coming that can
[00:14:30] be driven by GPUs. This is a very high performance, high density. Um, some of our work recently there,
[00:14:37] you know, uh, we worked very closely on, on the ND link technologies, but then also we're, uh, working
[00:14:44] on using ethernet for those, for that scale up fabric inside a rack. All right. So zooming in even
[00:14:52] further, let's talk about the blades or the silicon. And again, hopefully you all get the sense. Networking
[00:14:59] is in all of these places. This is what, this is part of what is so exciting. Networking down at the chip
[00:15:04] layer, right? This is sometimes people call it the scale in network. Um, but how fast is, are the chips
[00:15:13] able to talk to each other die to die interconnect? What is the speed of the chip and memory, right? Do I
[00:15:20] disaggregate the memory and what's the latency of that, right? So, um, the design of, of the chips itself
[00:15:28] is very, very dependent on, on networking. So, this is our own, uh, meta, uh, roadmap for, um, for our chips,
[00:15:42] training and inference and, you know, different types of accelerators there.
[00:15:47] So that kind of brought us all the way through, uh, from, from the global to the ASIC level. Just wanted
[00:15:54] one, uh, one lesson here. It is not, it looks clean, like, oh, things get bigger. Um, and it's,
[00:16:02] and this is just how things progress, double, double, double, double. I'll tell you for one of the
[00:16:06] learnings from, uh, as I was talking to our, our teams, they said, it is not linear. There are so many
[00:16:14] restarts, pivots. Wait, we thought we were going to go with this, go with that. Um, we tried this,
[00:16:22] try that for a little bit and then we move, we move into something else. Um, I don't think it's
[00:16:27] quite as chaotic as it was in 2020, 23, uh, you know, from a controller and, uh, routing and, uh, ECMP
[00:16:36] design perspective, we must have run through half a dozen versions of topologies in that 2020 to 2023
[00:16:44] period. It's a little bit more stable now, but still, um, there's a lot to keep up with because
[00:16:50] the things that drive the, the innovation, what drives the cycle now is how fast can you keep up
[00:16:58] with the GPUs or the accelerators. They're pushing to put new product out every year. Multiple companies
[00:17:05] are doing that. And so how do you keep track of that? And so that's why you want to stay, stay plugged
[00:17:12] into the AI, uh, roadmaps and then find out how networking matches that. And that's again, one of
[00:17:20] the, the reasons why it's a great time to be a network engineer because of this whole tied to all of this
[00:17:27] whole stack of technology. So if I look at that stack, right, just, this was actually something that we went
[00:17:32] through from a, uh, uh, just a scale numbers. It was the, it was boggling, right? We talk about chips
[00:17:40] at the nanometer scale, but then we're also talking about millions of kilometers. And I don't know if
[00:17:47] no one talks about what's a million kilometers. So one order, it's a, a megameter. It's a gigameter.
[00:17:54] We have gigameters like, but we don't talk about gigameters, but we still have, we're talking about, uh, a
[00:17:58] huge number of, um, of, uh, this, a huge range of distances. We talk about power, power optimization
[00:18:08] within the chip, but also within the building is an important constraint for infrastructure designers,
[00:18:16] but also the network is a part of that infrastructure. We have to understand
[00:18:20] what power is being taken up by the, the switches, the optics or the, the connections, the nicks,
[00:18:27] they all factor into this power. Um, so again, going from chip level, say, picajoules per bit,
[00:18:34] um, all the way up to the gigawatts. And then finally, from a performance perspective,
[00:18:39] it isn't so unreasonable for us to, to look at, um, getting to Zeta flops or even Yotta flops. I had
[00:18:46] to go look up. I was like, wait, let me make sure. Um, but if you look at an H 100 GPU, roughly,
[00:18:54] depending upon the, the accuracy, you know, FP4, FP8, you're talking about a petaflop.
[00:19:01] If you have a million H 100s, that takes you to X and Zeta flop, right? And then with each generation of,
[00:19:10] of improvement, let's say, you know, between H 100 and GB 200, there was an order of magnitude improvement.
[00:19:16] So then you look at the chips themselves, let's say over three generations getting, uh, three orders
[00:19:23] of magnitude faster. And then you're into this Yotta flop, uh, regime, right? So again, networking folks
[00:19:33] have to embrace all this. This is, this is amazing, right? So that's the first lesson and why I think
[00:19:38] it's exciting. There is so much for all of us to learn together, right? Um, one piece of advice,
[00:19:45] don't learn just one layer or one component, right? You know, I'm the, I'm the fabric person that really
[00:19:52] knows this fabric, or I'm the, I'm the optics person. I really know optics, you know, FR4, ZR4, DR4,
[00:19:59] because this, these all have to work together. It behooves all of us, all of you to, to understand
[00:20:10] different parts of the stack, end to end, and up and down.
[00:20:16] That gets into the software side also. I haven't gone over the software yet. I'll go, I'll go into
[00:20:20] that. That's why I think this is, this is the second reason why I think it's great to be a network
[00:20:25] engineer. Networking, network engineers have always been the ones that pull it together
[00:20:32] from an end to end solution. So if I look back and again, thank you for, you know, for dating myself.
[00:20:41] I remember when one of the network engineers I worked with, uh, in the nineties, uh, Sean Finn,
[00:20:50] and he, he came and he said like, I've got this network general sniffer.
[00:20:55] I'm sort of curious who's used the network general sniffer. Shouldn't, it should be a little smaller
[00:21:00] group. I've got this network general sniffer and there's this port 80 stuff. Thank you for raising
[00:21:05] your hands. So, um, there's this port 80 stuff going on. We got, you know, it's showing up. We,
[00:21:11] we got to learn about it. Okay. So we started, okay, what web? Okay. What's the sort of technology
[00:21:16] going on there? So as network engineers, we had to, we'd kind of like, Hey, what's the application
[00:21:22] trying to do there? And then of course the nineties saw, okay, now we're doing, uh, online,
[00:21:27] you know, E, E trading. We're doing all this stuff on the web. Um, so we learned about that application,
[00:21:36] right? Look at the next one. Um, streaming, right? I remember working with folks, um, early 2000s,
[00:21:46] trying to understand this MP3 streaming. There was a company, real networks. We were trying to like,
[00:21:51] well, this is pre YouTube, right? Pre YouTube. And like, okay, what, what is that? Right. And so then
[00:21:57] we all started learning about playback buffers and jitter sensitivity and loss, right? So to make
[00:22:03] the network work for the application, yeah, we all learned about, uh, about that, right? Why was jitter
[00:22:11] important? Um, and we all felt it at the time when, when our videos would freeze, we still feel it, by the way,
[00:22:18] it's always the network. But that's when we really learned about this sort of real-time video and
[00:22:25] real-time streaming and the performance that the network can provide and how it impacts the application.
[00:22:32] So here we are today. I mean, I can go through the different technologies, mobile phones, iPhone,
[00:22:36] you know, that, that was also another, uh, had implications to, to network engineering and design.
[00:22:43] Um, but now here we are with AI and that's what, um, again, just like we did it for the, the port 80
[00:22:50] that we were seeing on the sniffers, just like we did it for streaming. We have to bring these,
[00:22:56] these various components together and make it work for, um, for AI, right?
[00:23:02] The role of a, of networking for AI, it seems so obvious.
[00:23:12] But a lot of folks, when we think about AI models, a lot of the research just for,
[00:23:18] for, for ease, people normally think of having one GPU.
[00:23:23] They can train a model, uh, on one GPU, they can serve a model on one GPU.
[00:23:31] And that's where a lot of people have started because most people don't have access to very large
[00:23:36] GPU clusters, right? I, we throw the numbers up 128 GPUs, you know, those are big numbers. Like most
[00:23:42] people don't have access to that many GPUs. But once you start leaving a GPU one, you hit the network.
[00:23:49] Now a lot of it is hidden inside the, um, the, the system, say with, with the H100 systems,
[00:23:55] you had five G or eight GPUs that was kind of hidden in there. But once you start breaking out into
[00:24:01] the rack, that is where network engineers play. And that's where we were critical.
[00:24:06] Let's go over a little bit about the software. Um, again, I'm not going to go into a lot of,
[00:24:12] uh, too, too much detail here, but it's trying to give you the same analogy, just like we learned
[00:24:16] about jitter and playback buffers, just like we learned about, you know, how many turns does it
[00:24:22] take to do an SSL authentication? And can I reduce those number of turns so that, you know, I have,
[00:24:27] I have faster web, uh, connectivity, learning about the AI application and how it's trying to leverage
[00:24:34] the network helps us do better job in supporting the application. So one of the techniques I mentioned,
[00:24:42] when, you know, when you talk about multiple GPUs, how they're going to talk to each other,
[00:24:46] there are different things that, that there are different ways that the model, uh, designers
[00:24:50] think of doing that, right? You don't want to multiply, uh, a row of them, you know, you want to do
[00:24:57] one row of matrix multiplication. You don't want to do that over the WAN.
[00:25:01] Just, uh, hopefully is, is, uh, because there's so many of those matrix multiplies. You want to do that
[00:25:08] in as fast, you know, ideally you don't hit, uh, any of the fabrics as fast as possible. Um, and that's
[00:25:14] called tensor parallel, right? And so when, when you work with model designers, and this is something
[00:25:21] that we, a lot of our network engineers learned over the, the last few years, the model designers
[00:25:27] are used to working on one GPU or working in one system. I mean, this was in, this is four or five
[00:25:32] years ago, but now I think most model designers know, hey, there's a network. Um, but they, they want
[00:25:39] to understand, they need to know, and we can engage with them like, hey, you know, this type of operation
[00:25:44] you're trying to do, let's say a fundamental matrix multiply, there is speed of light. I can't do,
[00:25:49] I shouldn't do that over the WAN, so let's do this this way in, in this part of the network.
[00:25:53] And then the other parallelisms that they're trying, that were the models trying to do,
[00:25:58] let's say data parallel, you're going to show, it's basically sharding of the, of the, of the model,
[00:26:03] the data that the model can be trained on. That I can do over more, over longer distances,
[00:26:08] over across scale out or even, uh, the WAN, well, you know, from a fabric perspective.
[00:26:14] So understanding and, and being the representative of the network, the capabilities of the network,
[00:26:19] I can get, I can, this is the type of networking I can do within RAC, within building, between buildings,
[00:26:26] and let's work together on how to, to make that scale and work well for models. That's the role of a
[00:26:32] networking engineer. Not all models are alike, right? I think we, and not all, sorry, not all models,
[00:26:42] not, not all models are alike, that's right, but not all workloads, not all AI workloads are, are the same.
[00:26:48] So you look at training, we talk about training, so many flavors of training. Training for a large
[00:26:55] language model is different than training for, uh, for ranking and recommendation, R and R, right?
[00:27:03] Training, there's also the world of like a pre-large foundational model training, and then there's
[00:27:08] post-training, kind of where you're really trying to optimize the model on a particular, uh, use case,
[00:27:14] like coding or image recognition. And then inference itself, inference itself is somewhat dependent on the
[00:27:21] model, right? If your inference model is so, if the model you're trying to serve up is so big that it
[00:27:28] doesn't fit on one GPU system, you may want to break it up, right? So now one stage of inference, pre-code,
[00:27:36] may be on one set of servers, and then the decode part of inference is on a different set of servers.
[00:27:43] Now, there's a lot here, and that, um, you may not know, depending upon your, your exposure to all this,
[00:27:52] it takes some time to kind of understand the different, uh, models and workloads and requirements,
[00:27:57] but again, this is something that, that we have to do, because the, the network is fundamental in this.
[00:28:04] If I look at the, the dimensions right here in this graph, right? I haven't actually, I haven't explained the
[00:28:08] graph here, um, just kind of, for the different types of workloads, what is most important to them?
[00:28:16] So pre-fill, you can see the orange, is very compute intensive, right? If pre-fill, just a quick
[00:28:24] pre-fill decode, pre-fill is like, I'm trying to think of what I'm going to respond, so I take some
[00:28:28] time to think, then decode is, I'm going to tell you what I, what I just thought about, right? So when you
[00:28:33] ask somebody a question, like, they may sit there a little bit, think, and then, then they tell you the
[00:28:40] answer, and hopefully they're not pausing in the, in between the answer, they're just, they're just
[00:28:43] letting it come out, right? So the compute part of inference, oh, sorry, the pre-fill part of inference,
[00:28:48] compute, um, is very compute intensive. The decode, which is the red, you can see network latency
[00:28:56] sensitivity, right? So, because you don't want the jitter, or the, you don't want the pauses in the
[00:29:02] speaking, in the response being served up to you, you want it to be just very smooth and coming out
[00:29:07] as quickly as possible, right? So this is, this is the translation, sometimes people talk about like,
[00:29:11] time to first token, um, as a, as a model evaluation criteria, how quickly can you get that first token
[00:29:20] is really, uh, like a compute problem, and then, or a pre-fill problem, how do you serve that up
[00:29:26] is a decode problem? And again, this is, this is, this reminds me again of, of all those other times,
[00:29:32] like, oh my gosh, I got to learn about streaming, I got to learn, I'm going to get to learn about
[00:29:36] web, um, this is what you get to learn now as a network engineer.
[00:29:42] And there's a lot more, I actually, that graph doesn't have, I just throw up a couple of models,
[00:29:46] there's mixture of experts, which implies a different sort of, uh, parallelism that you can do on,
[00:29:51] an expert parallelism on the, on the infrastructure. There's reinforcement learning. Reinforcement
[00:29:57] learning has a whole bunch of, of servers talking to each other, and it is like, more like a traditional
[00:30:02] distributed system, right? There's even work that's going on at the software layer to say, um, go and
[00:30:11] make sure that the traffic between the GPU and CPU are minimized, right? So in a traditional, uh, AI system,
[00:30:21] there's a lot of coordination between the CPU and GPU, because the CPU kind of does all the planning
[00:30:27] kind of, and then it kind of sets it up for the GPU to just go, go compute very fast. But it introduces
[00:30:33] extra turns between this, between the CPU and the GPU. And, and reducing the number of, you know, turns,
[00:30:41] right? Reducing that communication and just having it being very GPU centric increases efficiency,
[00:30:47] right? Um, and again, I alluded to it earlier. It reminds me of when we were trying to do some of the,
[00:30:52] um, the SSL handshake work again in networking, when you're trying to do, you know, this is now internet
[00:31:01] distances and latencies. If you're in, if you're in a part of Asia and you're connecting to a data center
[00:31:08] across an ocean, every round trip in a handshake for, let's say SSL is expensive. So you want to reduce
[00:31:15] those turns. Again, network engineers familiar with that. This is the same problem, right? I don't,
[00:31:22] I mean, it's just as much, it's happening in a much smaller context and it's happening, you know,
[00:31:28] millions of times, uh, during a workload, but it's trying to optimize the CPU to GPU communication
[00:31:35] or, or minimize it, right? Again, from, uh, um, there, there's ways to get involved here at the
[00:31:41] software layer. We announced some of our, our library work with PyTorch to, to provide a more scalable and
[00:31:49] flexible framework for this GPU to GPU communication. And actually we open sourced a lot of it. Uh, one of
[00:31:55] the things I'm really proud of that the team, I think excites the team to be able to share our,
[00:32:00] our work. We've shared how we we've made a rocky scale over the last several years. Um, we're sharing
[00:32:09] how that the software layer were, were, and we've even open sourced some of the code in terms of what
[00:32:16] needs to happen to make a hundred thousand plus GPU cluster work, right? Last thing in terms of, um,
[00:32:23] the, the, some of the more software, then I'll get to this, uh, this integration part. Just wanted to
[00:32:30] map LLM because there's a lot of different pieces we're talking about. What does it mean for an LLM
[00:32:34] to work on the infrastructure? Right? So just trying to break it down, right? Pre and post training
[00:32:39] of LLM uh, four happened on the 129,000 cluster. And now it's happening on GB 200 clusters, um, in our
[00:32:48] gigawatt cluster. And so that pre and post training happens. There's a lot of capacity there. It's all
[00:32:54] running on ethernet rocky. But again, that's the training part, different stages of training. Inference
[00:33:02] happens because we have this billion active users for ourselves, but everyone's trying to serve it up
[00:33:07] as quickly and as broadly as possible. Inference happens on not 129,000 machines, you know, hopefully
[00:33:14] it doesn't require that. Inference happens in a small number of machines, either one machine or a couple
[00:33:19] machines. Um, cause you want to really try to, that's, that's where the numbers come, right? Training a
[00:33:25] model may take 120, a million GPUs and they run for a month.
[00:33:32] Um, but then you want to be serving a billion users just as fast as possible. And that, that's like a web,
[00:33:38] a web scale latency, uh, workload that you really want to kind of watch. Um, and that's kind of the
[00:33:46] serving inference part of it. And we use different, what's good for large scale training may not be good for,
[00:33:53] for inference. Um, but you want to kind of have fungibility. So you're not totally committing your
[00:33:59] infrastructure to just one, uh, this chip or this cluster is only good for this, but,
[00:34:04] but you have to then be aware of what can this cluster do. Okay.
[00:34:10] So that's networking pulls it together, right? Network engines pull together.
[00:34:15] The technologies have to work. So you have to understand the performance of the application.
[00:34:20] Then operationally you have to debug it, right? You have to run it. Um,
[00:34:25] is it the network common question, right? If, if you look at, if you look at again,
[00:34:30] falling back at the model developers, they'll say like, look, when I'm pro, when I'm programming,
[00:34:37] there's a very clear fork in, in the pie torch in the model code. Am I talking to a local GPU?
[00:34:43] Or am I talking not to a local GPU? That that's it, right? And so anything that happens to talking
[00:34:49] to a, not a local GPU looks like a network problem. So it's like, I'm talking to my local GPU, all
[00:34:56] hundred thousand or anything else that's happening looks like network. So we get this question a lot,
[00:35:01] but is it a network and we have to, we have to debug the components. There are millions of components
[00:35:06] in one of these large clusters, billions across the fleet. We have to work on auto diagnosis,
[00:35:12] um, correlation techniques. So, and again, this is because of the application knowledge.
[00:35:18] That's what lets us do that, right? As a network engineer, um, network engineers also, again, this is,
[00:35:26] this slide is just to emphasize the, the non greenfield nature of the problem. In this picture,
[00:35:34] there are four generations of buildings that are coming together into our gigawatt cluster,
[00:35:40] right? The H, the H style buildings are, are, are, uh, an older generation. The ones at the bottom
[00:35:47] are liquid cooled facilities. So all those little dots in the bottom are, uh, are cooling units that
[00:35:54] cooled buildings. So there's like closed loop cooling that goes out. And then there's one of our sprung
[00:36:00] structures in here, kind of next to the middle H. There's, there's a, there's a tent, uh, which we
[00:36:06] had to stand up super quickly. Uh, and then the buildings at the far right are, uh, not our buildings,
[00:36:13] but ones that, that we've leased and, and have put, put our, uh, infrastructure inside.
[00:36:20] Heterogeneous. So when we talk about the Prometheus cluster, it is not purpose-built from,
[00:36:26] from the ground. It has multiple generations. And again, network, the networking is different in each
[00:36:30] one of these. And then we're trying to network them all together. Is it, is a really good problem for
[00:36:35] us? No, it's a challenging problem. Uh, it's an intense problem. Okay. Last thing. Why again,
[00:36:44] do I think it's a great time? We can go faster with AI ourselves. I've been talking all this time about
[00:36:49] building the network for AI, this is now the AI for networking, right? How can we all go faster?
[00:36:57] So at previous nano talks, you know, some of them, some of my meta colleagues gave, uh, have talked
[00:37:02] about how we automated away our knock. We don't have a knock. Um, we figured out the thresholds and rules.
[00:37:08] We wrote code to detect anomalies, take automatic correction, right? We've also automated some of
[00:37:16] the config generation, you know, through intent-based networking. And then, um, there's a whole system
[00:37:22] to push and go update. But in both of those cases, we had to write a lot of software. And so network
[00:37:29] engineers were paired up with, with software developer teams. Now the network engineers wrote a lot of
[00:37:35] software. Perl, Python, Go, C++, everyone's writing a lot of software, right? But certain systems you had
[00:37:43] to pair up, uh, and there was a dependency with, um, with software engineers. Even if I want a dashboard,
[00:37:50] I had to go talk to a react UI person to kind of build me a cool dashboard that I want, right? But in
[00:37:56] this AI era, you can quickly develop, companies can quickly develop the software that we need. I used to,
[00:38:02] I've talked to a number of, uh, operating companies to say, can we, how do we kind of go faster with
[00:38:07] software development? And one of the blockers in the past has been, we don't have a large software team.
[00:38:14] You know, you don't need a large software team for a lot of this. Some of this you do,
[00:38:17] but pair up network engineers or network engineers can go, go just use the technologies themselves, go use,
[00:38:26] um, uh, an agent to help you. Work with software engineers who are also using those same agents,
[00:38:33] um, and then just work through it. A lot of what I hear sometimes somebody will go like,
[00:38:39] you know, I, I asked the AI to generate a fabric and it didn't look good.
[00:38:45] I know. Okay. Not ready, right? Maybe it's ready for that coding stuff or it's ready for whatever image
[00:38:52] generation, but it's not ready to generate my fabric definition, right? I get it. It's probably,
[00:38:58] there's probably not a lot of training for network fabric generation, but work with it. Explain the rules
[00:39:04] that you have for how you want to write your fabric, explain the rules for how you want to fix things,
[00:39:11] and automatically remediate things, and kind of iterate with it. You can't one-shot it. Anyone who tries to one-shot a prompt
[00:39:17] with a, with an LLM, you, you're almost always frustrated by that. But if you work with it,
[00:39:23] you can, you can really get a lot of good results. And this is something that we've been doing internally,
[00:39:28] um, haven't been, we can't publish that, some of that work yet,
[00:39:33] but it is definitely accelerating us. So, um, that's, that's, uh, that sums it up, right? I think
[00:39:42] it is a great time to be a network engineer. Lots of, lots of technology. It leverages what we've been
[00:39:49] doing, kind of our, the, the natives, the, the, the skills that we have of pulling together solutions,
[00:39:54] and we can supercharge ourselves with AI. So, um, I think I've got a few minutes. Thank you very much
[00:39:59] for listening, and hopefully we have a few minutes for questions.
[00:40:07] We've got mics, if you've got questions, or you just yell it out, and then I'll repeat it,
[00:40:13] if anyone has a question.
[00:40:18] Okay, thanks. Hi, Ido Kadeem from CPacket. Um, you're from one of the few companies that really
[00:40:26] run inference at huge scale. And, uh, you know, there's a lot of discussion about the design of
[00:40:31] clusters. But I'm really interested in how you see the network at the edge of the clusters,
[00:40:39] where all of these inference requests and needs come from all different places with all different
[00:40:44] characteristics that must have a unique network implication as well. Yeah, yeah. So I think that
[00:40:51] is, that is something that is, uh, that is coming, right? Because when you look at that global
[00:40:58] infrastructure we have, and that many companies have, right, you want to move, you want to move as
[00:41:03] much of it as close to the end user as possible. That's why we have caches all over the world,
[00:41:08] we have POPs all over the world, and over time more and more of that, that edge serving will push
[00:41:15] out to the edge, will, will get closer to the user. In some cases we look at it, can we do it on the
[00:41:19] phone, right? But then if you can't do it on the phone, can you do that in a localized compute
[00:41:24] facility geographically? And that is definitely, um, that's definitely coming. And it is because of a
[00:41:31] fundamental networking problem there. Better to get to serve it up locally if possible. There is a
[00:41:37] lot of complication there too. Um, different, uh, different countries have different rules about what
[00:41:43] data is hosted where, that sort of thing. But, um, it is, it is a networking problem that we'll get to.
[00:41:49] Now, and then we're working on. Oh, sorry. Yeah. We'll, we'll round Robin here. Uh, yeah,
[00:41:56] actually I have a question from remote. Uh, so this is from Daryl Newcomb, who's, uh, okay. Perfect.
[00:42:01] Unaffiliated. Uh, the question is, if I understood correctly, uh, with your description of DSF
[00:42:06] as being straightforward and performant, what are the top primary drivers for having used non-DCF in your
[00:42:12] largest footprint of clusters in a single region? Ah, okay. Good. Um, so just to, to recap there. So I
[00:42:18] mentioned a couple of scale out fabric technologies, DSF and, uh, uh, uh, just disaggregated scheduled
[00:42:24] fabric and a non-scheduled fabric, right? Um, the trade-offs, I think there was the question
[00:42:32] disaggregated scheduled fabric just looks like one big, one big router to the application. And you
[00:42:37] don't worry about performance usually going into and out of a router. You just, you rely on the back
[00:42:42] plane to work. Um, but on the other hand, that does come with, with some overhead of scheduling to making
[00:42:48] sure like, Hey, that output port that's on the other side of the building. I can, I can get a path and
[00:42:53] it's, and it's ready to serve this up. So there is a lot of, uh, scheduling that goes on and that does
[00:42:59] have some impact on performance. So, um, but it has flexibility. So when we moved to something like a
[00:43:07] non-scheduled fabric, there's a lot more work then to make sure the applications from the application layers
[00:43:12] through the NIC are properly tuned to work with, uh, a non-scheduled fabric. So that's the, that's the
[00:43:20] real choice around that. Um, we have both different use cases for both. Um, but then when you talk about,
[00:43:27] uh, extending the scheduling across very long distances, then it becomes, uh, you know, much harder to get
[00:43:34] that, uh, across, you know, between buildings, uh, or across longer, uh, distances. Thanks. Thanks for the
[00:43:41] question. Uh, I think this is our last one here probably. Okay. Thank you. Hi, uh, I'm Meg from Marvell.
[00:43:47] Oh yes. Hey, good to see you. Really engaging talk. And I have a lot of questions, but I think I only have
[00:43:53] time for one. So I was wondering if you can talk a little bit more about the, uh, decode for inference and the
[00:44:00] memory bandwidth requirements and how that may impact the network. Right, right. So the question,
[00:44:06] the decode, decode for inference, I think, and again, this is why I think it's exciting for,
[00:44:13] for all of us in the room to understand, like, what does, you know, that's a little bit like
[00:44:17] understanding why, why was the, the playback buffer different, you know, why was it better in YouTube
[00:44:26] versus in this or what, you know, Netflix streaming uses TCP, but Google streaming uses quick UDP. So
[00:44:33] it's really, you're really getting into this deep understanding of what the application is trying
[00:44:37] to do. So decode specifically, um, has a set of requirements in terms of, um, making sure the
[00:44:46] latency is low enough. So, and what, whereas pre, pre-fill has a much higher, you know, say memory
[00:44:54] requirements. Right. So in a way, I would say it's not so much specialized just to decode,
[00:45:01] but it's more like what, what does a pre-fill server have to do that a decode server doesn't.
[00:45:07] Right. And how much of the model needs to be available, how much of the data needs to be available
[00:45:12] for that pre-code step. And you can optimize it not to have, not to require that in decode. Right.
[00:45:18] Right. Um, and so then you get into questions that are, do I buy dedicated decode servers versus pre-fill
[00:45:24] servers? That's a very optimized problem for, for large scale infrastructure. Um,
[00:45:32] it may not be generally applicable, but you know, I'm happy to get into it. But if you think like,
[00:45:37] um, I think of what, what I see a lot of, uh, other companies doing right now, maybe, um, depending on the
[00:45:44] size of their infrastructure, maybe their question is like, do I get inference servers versus training
[00:45:50] servers? Right. General GPU servers for training. Do I get more dedicated inference servers? That's
[00:45:55] probably where I see a lot of the, the infrastructure debates right now. Um, understanding also though,
[00:46:02] that if you get things that are super optimized for inference and you need to shift around your,
[00:46:08] your, uh, infrastructure, if I need more training, all of a sudden, I can't just move,
[00:46:13] move the training workloads onto these dedicated inferencer. So I think, um, happy to go into it more
[00:46:19] offline with you, but I think, um, more general purpose when you think about infrastructure,
[00:46:26] you're trying to weigh, uh, am I, am I getting the, the most out of the, the fastest performance,
[00:46:32] but then am I also getting a certain amount of optionality and fungibility? Because you don't
[00:46:36] want to just commit, you know, large amounts of infrastructure can only do one thing. Um,
[00:46:42] you're always trying to balance how much flexibility you need.
[00:46:45] So that, that trade-off may be different for different companies.
[00:46:49] And we can chat more afterwards. Thank you all very much. I'll be hanging out, uh, happy to chat,
[00:46:55] uh, enjoy the rest of the time. And thanks for the invite and letting us talk here. And again,
[00:47:00] this is work on behalf of all the teams at Meta, many of them who are here today. So hopefully just
[00:47:05] connect with me or connect with them. We're happy to chat more. Thank you.