From Generative AI to Agentic AI: The Memory-Centric Turn in Datacenter Design

[00:00:00] I am the CEO and the Founder of the Connection, and I am the chair professor of CARS. [00:00:13] I really want to have a presentation that is about us today. [00:00:16] We will today go over the mineral-centric data center architectures [00:00:22] by discussing some transitions from the generative AI to the Asian AI, [00:00:29] which will also handle the KB cache memory walls in the cases of the next generation CSR. [00:00:36] So, nothing special here, but it is on my definition, [00:00:40] so I would like to ask you not to share any of the information with our machine. [00:00:47] So, here is a most simple sentence to introduce what the financial idea is and what we are bringing on. [00:00:53] So, we developed the new solutions that can drive innovation in AI infrastructures. [00:00:59] This is actually the same as the original meaning of the Panentia. [00:01:04] The data is, as you know, the memory. [00:01:06] And Panentia means perfect memory, which can enable perfect brain and computation capability. [00:01:12] So, we are developed all the different interconnect technology and the interconnections and the switch and infrastructures. [00:01:20] So, basically, we are really care about the interconnect technology stuff. [00:01:25] So, we build on our own with a physical layer, digital layer, and so on. [00:01:31] So, we have all these layers for the CSR and the other types of link technology. [00:01:38] And the reason why we are really care about these things to build from our own is, you know, as you know, [00:01:45] if you are very familiar with it and have some engineering effort to some specific technology, [00:01:51] you can actually, you know, easily shift from here to there and there to here, right? [00:01:56] So, as we are having experience about the CSR and we are also capable of developing the [00:02:03] all-to-exploitation link or the other types of link technology as well. [00:02:07] So, that's why we are providing the different types of controller IP and the incoming IP. [00:02:14] So, as we have RP, what we can do is we can customize the screen itself, right? [00:02:19] So, we can put the controllers for the CSR or the custom link to make the endpoint for the memory expender [00:02:26] or the CPUs or other types of GPUs as well. [00:02:29] But, we need to connect them all together at some point, right? [00:02:41] So, as we have IP and we have controllers and we have engineers who have knowledge about [00:02:47] the whole different link layers into connect technology. [00:02:51] We also built up the switch, the CSR interconnect switch. [00:02:55] It's not a network switch here. [00:02:57] It's a switch to connect all the things that are scalable option. [00:03:01] And under such hardware, we have like federated measures and some kind of deformers [00:03:07] and the system software to run the actual systems on the scalable architecture like this. [00:03:13] So, here is the demonstration videos that we made for the agent K&M sampling. [00:03:19] And if you see here, the user request is put together a pre-visit briefing for my upcoming meeting [00:03:25] what I'm going to share. [00:03:26] And if you see here, the agents make the four different workers and each worker is, you know, [00:03:31] search the contact, location, the weather, and report. [00:03:34] And the WAXA is the CXL-based memory expansion that has a KB cache on memory. [00:03:41] But if you have combined SSD and memory DRAM, then this expansion is still working on. [00:03:47] So, the CXL-based agent AI has been done now. [00:03:51] But, you know, even though we make these factors double, they're still working on. [00:03:56] So, this is one of the benefits that we can bring up, but there is multiple different locations [00:04:02] that we can leverage, you know, the fast memories on the internet. [00:04:08] And I'm going to move to today's topic. [00:04:11] This is a YouTube summary, you know, a takeaway to understand what I'm going to today talk about. [00:04:17] And this is good enough, I think, to pay attention. [00:04:21] So, we will deal with the AI, we deal with understanding the AI, [00:04:27] and then think about what kind of infrastructure is actually required to support such kind of AI. [00:04:33] I'm not going to deep dive into the AI fundamental. [00:04:36] I mean, that has been done in a previous keynote that I made, like multiple times. [00:04:42] So, that has been uploaded to you. [00:04:45] So, I just, you know, summarized just a small part that you already know about the AI, [00:04:52] which is a good ingredient to understand and transition from the generative AI to the agent AI. [00:04:59] So, here, the AI-driven problem solving opportunities. [00:05:03] And nowadays, we can see the AI here and there, right? [00:05:06] So, casual conversations, or the, you know, composed cover song, [00:05:10] or we create images based on the text description. [00:05:15] So, the question you might be having is, [00:05:17] how can I solve this complex neural world problem, right? [00:05:20] So, we're going to automate it. [00:05:22] So, this is a simple example for the classified panda and bicycle. [00:05:27] And, as a human being, we can easily figure out where is a panda and where is a bicycle, right? [00:05:33] It's a very simple way that we can figure it out. [00:05:37] But, it's a machine, as you know, it's very difficult. [00:05:40] And, it's essential to represent this world in a form of things that machine can understand. [00:05:47] So, what we're going to do, what we're going to do is, [00:05:50] convert the data into some kind of numerical format or mathematical object, [00:05:56] like the numbers, vectors, and matrices, [00:05:59] to make it understandable for the machines. [00:06:02] So, once you map the data into some kind of dimensional spaces, [00:06:07] in this example, it's just a simple Y and X, right? [00:06:10] To dimensional spaces. [00:06:12] Then, what you do is, you can just make a linear function between them. [00:06:16] And, then figure out where is a panda and where is a bicycle. [00:06:19] Highlighted by the blue and orange, right? [00:06:21] That's what we are doing on. [00:06:23] So, finding the decision boundaries on the data that we have. [00:06:28] But, sometimes, there is a complicated situation of race, like this case. [00:06:33] The panda also has many different variations and different colors. [00:06:36] And, sometimes, we cannot figure it out well. [00:06:39] Then, what are we going to do? [00:06:40] In this case, we can just simply increase the dimension, right? [00:06:45] And, then figure out the 3D spaces where is the actual data or the patterns that we are trying to find out. [00:06:50] So, dimension is simply to find the primary spaces and, you know, the primary spaces should be adjusted to determine where is the actual data or the patterns that we are trying to find out. [00:06:59] So, dimension is simply to find the primary spaces and, you know, the primary spaces should be adjusted to determine the decision boundaries through the model training. [00:07:09] So, the model training. [00:07:10] So, the model training that we are talking about is usually just to figure out where the actual decision boundary exists using the loss function, right? [00:07:19] So, it is very similar to the gradient descent, right? [00:07:32] So, it is very similar to the gradient descent and the genetic algorithm. [00:07:36] Find out what the slope is zero, going to depth, and find out the boundaries between them, right? [00:07:42] So, once you have well-cleaned models and then there is some usual cache, then you can figure out where the actual cache exists, right? [00:07:54] So, what I did was, what I found here, it is a textbook, it is a two-dimensional space, you know, only two-dimension and all key data points, [00:08:06] and you can easily figure out where is the decision boundary, but in reality, it is not, right? [00:08:12] So, the data leaves, you know, thousands and hundreds of thousands of dimensions, you know, spaces. [00:08:19] So, that is very difficult or impossible, mostly impossible, to draw the decision boundary on that raw data itself. [00:08:30] So, what AI does to do then? [00:08:32] So, AI, you know, not actually draw the decision boundaries on the raw data. [00:08:37] Rather than that, what it does is transform the data, layer by layers, to make the landscape obvious. [00:08:47] So, see that a bottom line, the initial bit of the presentations, what happens there is the landscape is stable, right? [00:08:54] And very noisy, and large structures as well. [00:08:57] But, at each layer, patterns are more recognizable, right? [00:09:02] And the irrelevant variation patterns, and we can figure out what kind of things that we need to distinguish at the final stage. [00:09:10] So, the idea is simple, but very hard, but then you need to know that, you know, how we can actually make this transition, right? [00:09:19] So, let's see that, what kind of operation it is to understand this kind of landscape. [00:09:25] So, there are three different operations: projection, and applied the non-minuality, and then lead to the attention. [00:09:32] So, the first is projection, which actually stretched the landscape, or rotate the landscape, to figure out where is the peak, right? [00:09:41] So, now the net state is more clear, compared to the right side. [00:09:46] But, the projection cannot ban the figure itself. [00:09:50] So, what we need to do is, we have to add non-minuality. [00:09:53] So, in this case, what you will do is, the hidden state actually comes out, and you can make the multiple peaks, which was invisible at the projection stage. [00:10:04] But, you know, the issue is that, now you have multiple peaks, but you don't know which peaks that matters, right? [00:10:12] So, what you need to do is, you need to get attention. [00:10:15] So, the attention maker is sharp, and suppress not one important part in landscape, so that we can easily figure out where we have to go. [00:10:25] So, I am done, but still, we are in an objective summary, and I am done with the AI model part, and we are going to talk about the KB tests, based on what I mentioned here. [00:10:37] And, the thing that you need to know is, this is an AI model, right, but where actually we can run on a machine. [00:10:45] So, this is an example of the AI infrastructures. [00:10:48] Now, let's take a look at the infrastructures behind the AI system. [00:10:52] Do you know what this yellow cable here is? [00:10:55] First, the part that, you know, catches your eye, should be like cables, right? [00:10:59] So, there is a yellow cable that whittle through all the components in the rack. [00:11:05] So, we call this, we practice the link, or the envi-link. [00:11:09] So, the interchangeable can use it, because the features are very similar to each other. [00:11:14] So, I am just going to talk about this technology, the envi-link. [00:11:17] So, the AI models, when you execute, what happens there is, yeah, fast GPU or the actual is important, but, as you know, nowadays, we have to execute models in parallel, right? [00:11:32] And, once we execute something in parallel, we have to conduct and reduce the result in some ways, and they have to communicate each other. [00:11:40] This is a very important part of the country-centric architecture and infrastructure. [00:11:47] So, this is enabled by high-speed link, like the envi-link. [00:11:53] Then, you shouldn't forget the second cable. [00:11:57] Do you know what it is? [00:11:59] It's the scale-up, right? [00:12:01] Yeah. [00:12:02] Yeah, scale-up. [00:12:03] This is for the internet. [00:12:04] I believe that you shouldn't misunderstand about the scale-up and scale-up. [00:12:09] The scale-up means, like, you can access these memories just like the systemers. [00:12:14] Load scores and those things, right? [00:12:16] But, if you go beyond, that is related to the scale-up, which is associated with the internet part of the network process or some kind of software. [00:12:25] Now, the boundaries are very boring, right? [00:12:28] So, what do we have to do? [00:12:30] This internet being is for accessing the remote resources, like the memory and the storage devices. [00:12:37] So, whenever you need to go to the memory, not associated with this kind of memory, you have to go through the old internet, right? [00:12:45] And change all the things in the internet format, and go through the RDNA or whatever bucket. [00:12:52] You spend a lot of time to deal with the memory accesses or the storage accesses. [00:12:57] So, we can now see the different architectures for the computer-centric and the memory-centric, right? [00:13:03] The accomplishment here is, it's an all-connected MVLink, but still relying on the internet link for the memory and storage accesses. [00:13:12] But, if you see the memory-centric AI infrastructure that we are considering, [00:13:16] we can combine the MVLink with the CXL. [00:13:19] And the CXL is offered the unique features to provide memory semantics. [00:13:25] So, you actually do not need to have software interventions or the help. [00:13:30] You can just access all the memories that we have. [00:13:33] So, this is the viewpoint that we can make. [00:13:36] The other part, the MVLink, is to go to the multiple accelerators. [00:13:40] And the multiple accelerators also go to the memory expanders to get the data from there, right? [00:13:47] And the one thing that you need to know is, the CXL offered the multi-hole switch cascade, [00:13:54] which doesn't exist in an MVLink or the UVLink. [00:13:58] So, what it does mean by it is that, it can expand to the top or the multi-track scale. [00:14:04] But, when it's within the distance, that can be reached by the electric signal, [00:14:10] with some retirement assistance, right? [00:14:13] So, the other part is, why don't we make everything just as a memory, rather than just a CXL? [00:14:19] So, we call this, this is the next-generation link. [00:14:23] And everything put together as a B5 factory. [00:14:26] And the other spaces are simple. [00:14:28] And you can go to the accelerators and memories with a single memory of the spaces. [00:14:34] And the other part is, go beyond the electric signal. [00:14:37] We call this as a hyper-CXL. [00:14:39] So, here's our time. [00:14:41] So, we are done with a YouTube summary. [00:14:43] And now, I'm going to talk about some boring topics that are digging deeper. [00:14:47] So, about the specific part. [00:14:49] So, I have four different parts. [00:14:51] The first part is, why traditional alternative AI is a topic intensive. [00:14:57] And then, we will do the why memory is becoming a monopoly in AI constructions. [00:15:01] And then, I will talk about the agent AI. [00:15:03] Even nowadays, it's very popular, right? [00:15:05] And then, we will do some discussion about solutions that we make. [00:15:10] So, let's think about the traditional generative AI that you are very familiar with. [00:15:15] Like, the last generation models. [00:15:17] Right? [00:15:18] So, here, the generative AI introduced, you know, the new interaction models. [00:15:22] The one question, then instant answer out. [00:15:26] Right? [00:15:27] Those things are replacing the search engines. [00:15:29] And it doesn't require going through the whole index search and so on. [00:15:34] So, we can simply ask, can the model generate the answers, word by word, [00:15:39] so that we can understand what happen there is. [00:15:41] Then, how does this, you know, desktop process can work? [00:15:44] So, the core part is the understanding context. [00:15:48] So, it's depending on how much we understand the context. [00:15:53] That is the important part. [00:15:54] And, my mother tongue is not English, right? [00:15:57] So, what happens usually that I do for the English is due to reading comprehensive. [00:16:02] So, this process is very similar to understanding context as well. [00:16:07] So, let me take an example. [00:16:09] So, when we read, we naturally break sentences into small pieces like this, right? [00:16:14] And then, we think about the issue part step-by-step. [00:16:18] The same manner as your model process include as a sequence of comments and deciding what focus on this step. [00:16:25] Then, what we have to do is we analyze the context using the multiple different perspectives, you know, at the same time. [00:16:32] Like here, we highlight the sentence of the different colors. [00:16:35] It's a color focusing on the different type of information, which actually give you better understanding and better hint for the reading comprehension. [00:16:45] And, the next step is, after this one, we mark relationship and connection to focus on important information. [00:16:52] And, this process is breaking down context and focusing on key signals and constructing meanings. [00:16:59] It makes a better understanding for the generative AI. [00:17:02] So, this is a very powerful concept to understand what happened there is. [00:17:06] Now, let's go dig deeper. [00:17:08] So, how generative AI works? [00:17:12] Here's the example. [00:17:13] The weather is nice today, so I killed something. [00:17:16] So, all generative AI is next top and critical, right? [00:17:22] So, if there is input time, it's going through the embeds and put some encoding and decoding. [00:17:27] And then, it's going to go through the lineage and self-commands and output probabilities. [00:17:32] Just like what I mentioned before. [00:17:34] So, what we're going to do with this now, is the generative AI include context in a hidden state and update it in real time as a model. [00:17:45] Right? [00:17:46] Now, we're going to go approach to what you already know. [00:17:50] Hidden state can be translated or viewed from a different angle. [00:17:55] T, Quarty, Value. [00:17:56] Right? [00:17:57] So, Quarty is what you're looking for. [00:18:00] And, the T is how to find out and the value is what kind of information you carry on. [00:18:07] Let's think about the Quarty itself first. [00:18:09] So, Quarty expresses what the current token is looking for. [00:18:12] In this example, if you see the field, the field Quarty is linked to the emotional context. [00:18:19] Right? [00:18:20] So, what the Quartys do is, if there isn't any past tokens that are associated with emotional, then the weather is spinning out. [00:18:28] Right? [00:18:29] That is the Quarty, the actual name. [00:18:31] And, then, key expresses how the token can be found. [00:18:36] So, it acts as an index to free to query a relevant tokens. [00:18:39] So, if you see the nice, the key is emotional. [00:18:42] So, saying that, you know, I have emotional context. [00:18:46] And, if there is someone who tries to find it out, just pin it out. [00:18:51] Right? [00:18:52] This is the part that we are thinking about. [00:18:54] And, the value itself is the information that the key should bring out. [00:19:00] So, that is here, is related to the positive. [00:19:03] So, you know, the nice may carry the emotional signals and the positives, so that we can figure out what is the next tokens that we need to do. [00:19:13] So, all this information and all this process, to make just a single topic. [00:19:18] Right? [00:19:19] And, still, we will not reach the middle of the process. [00:19:22] What we do is, we have to do similarity check. [00:19:26] Right? [00:19:27] Dot product. [00:19:28] So, there is a Quarty and Key. [00:19:30] And, as you know, the Quartys should be matched with what the keys are in the arena. [00:19:36] Like, in the previous example, like emotional. [00:19:39] So, we need to do the Quarty and Key matching. [00:19:42] And, once it has been done, you have to do attention. [00:19:46] Which means, sharpen the score. [00:19:48] Right? [00:19:49] And, to figure out which one is much better to be matched with mine. [00:19:53] Then, we are going to apply the attention and the values as new scores. [00:19:58] So, we are done. [00:19:59] Now. [00:20:00] So, what we are going to do is, if we add a monetization, preserve the original information. [00:20:05] While incorporating updates and the stabilizing refining the hidden states. [00:20:09] So, we are not going to actually see the vanishing problems or the fading out information. [00:20:13] Something like that. [00:20:14] Now. [00:20:15] We are done with a simple process. [00:20:18] And, if you see the actual layer. [00:20:21] What happens there is, there is multiple heads. [00:20:24] multiple heads. [00:20:25] And, each head has a quick head and a softness and a valid attention. [00:20:29] Right? [00:20:30] So, this works like a multiple expert. [00:20:33] The same sentence can be interpreted in a different way. [00:20:36] And, in this example, we just give you the three heads. [00:20:39] But, in reality, there is a dozen per layer. [00:20:43] Right? [00:20:44] So, once you look back. [00:20:47] Step back away. [00:20:48] You can figure out. [00:20:49] Okay. [00:20:50] There is an implicit state in the landscape. [00:20:53] Which were very complicated. [00:20:55] But, the opportunistic state have a free peak. [00:20:58] Right? [00:20:59] Saying that which matters that we have to make decision boundaries. [00:21:02] So, the AI. [00:21:03] To not solve the problem. [00:21:05] But, make the problem. [00:21:07] And, let the state. [00:21:08] Easier. [00:21:09] And, then, solve the easy problem. [00:21:11] That is what we are doing on now day. [00:21:14] So, it has not been done. [00:21:16] It is the cooling phase. [00:21:17] Right? [00:21:18] And, the layers. [00:21:19] We talk about the single layers. [00:21:21] But, it is not a single layer. [00:21:22] There is a thousand layers. [00:21:24] In the generative AI. [00:21:27] So, as a very complex. [00:21:29] What we have to do is. [00:21:30] We have to add layers. [00:21:31] Add layers. [00:21:32] Add a number of layers. [00:21:33] And, each of them. [00:21:34] Have a different. [00:21:35] You know. [00:21:36] Perspective. [00:21:37] Understanding. [00:21:38] Like. [00:21:39] Understand words. [00:21:40] And, once it has been done. [00:21:41] You move to the. [00:21:42] Capturing relationship. [00:21:43] And, then. [00:21:44] Building the meanings. [00:21:45] And, this is going to give you. [00:21:47] A better understanding. [00:21:48] Of what actually happened here. [00:21:50] Right? [00:21:51] And, there is a thousand layers. [00:21:52] Now. [00:21:53] To see that. [00:21:54] How this is. [00:21:55] Compocentric. [00:21:56] And, then. [00:21:57] I am going to move to. [00:21:58] How this is. [00:21:59] Manicentric. [00:22:00] So. [00:22:01] The weight of generative AI. [00:22:03] And, we got one more stone. [00:22:04] And, now. [00:22:05] Let's see that. [00:22:06] Why. [00:22:07] The concrete heavy. [00:22:08] Actually. [00:22:09] Happen here. [00:22:10] Is. [00:22:11] So. [00:22:12] There is. [00:22:13] Quality. [00:22:14] Values. [00:22:15] What you care about. [00:22:16] And, so. [00:22:17] To max. [00:22:18] Right? [00:22:19] And, then. [00:22:20] Add. [00:22:21] And, then. [00:22:22] And, then. [00:22:23] Get the feedback. [00:22:24] And, there is multiple. [00:22:25] So. [00:22:26] Single layers. [00:22:27] And, there is a dozen layers. [00:22:28] What it means is that. [00:22:29] You have. [00:22:30] These things. [00:22:31] As much as. [00:22:32] Like. [00:22:33] What I showed here. [00:22:34] Done. [00:22:35] No. [00:22:36] In reality. [00:22:37] Only happen. [00:22:38] To get. [00:22:39] Single. [00:22:40] Token. [00:22:41] So. [00:22:42] This is very. [00:22:43] Intensive. [00:22:44] And, now. [00:22:45] Let's. [00:22:46] Reformat. [00:22:47] Visualization. [00:22:48] Into the. [00:22:49] Matrix. [00:22:50] That you have to do. [00:22:51] So. [00:22:52] If there is. [00:22:53] L. [00:22:54] Number of layers. [00:22:55] Each of them. [00:22:56] Have. [00:22:57] H. [00:22:58] Number of heads. [00:22:59] Right? [00:23:00] And, you have. [00:23:01] D. [00:23:02] Number of dimensions. [00:23:03] And, you have. [00:23:04] Core. [00:23:05] And, T. [00:23:06] And, those things. [00:23:07] Are. [00:23:08] Quite great. [00:23:09] Matrix. [00:23:10] And, now. [00:23:11] What happens here. [00:23:12] By far. [00:23:13] Is. [00:23:14] token token one and you want to get token two and there is multiple processes that you have to do [00:23:20] like what i mentioned before in the next step what do you do there is token to come right [00:23:29] and as this is the next problem predictors what do you have to do you have to compute the main about [00:23:35] the k q v matrix computation right so then now you can get the top and feed and once you do the next [00:23:45] problem whatever there is token one and two and three comes right to get a token for it and you need [00:23:52] to get also the multiple matrix to do the compute so there's a quiet you know compute heavy operations [00:24:01] that we do so now let's see the ai data centers that probably look like so you have gpu right and [00:24:09] now your models and the everything is going beyond like a terabyte but your atm is now going beyond [00:24:17] terabyte right so you have multiple gps to accommodate the data it's very particular to do these things [00:24:25] and they are connected all together uh by the envi link or the ultra-acceler link like this right [00:24:32] there is a gpu native fabric and don't forget all other resources so the gpu's are connected with [00:24:39] the p-size switch as you need to go through the go outside for the memory accesses or the storage accesses [00:24:45] right and there is rdmd that make the ethernet accesses right and there is also cpu [00:24:53] and everything is packed parallel so you need to have a same control plane to make everything good [00:25:03] so this is so one computation diagrams and how we're going to map this one to the real ai [00:25:10] infrastructure this is not your laptop computer right so we're going to put the cpu in a cpu frame [00:25:17] right and then you're going to put the gpu in an xy frame and then you know there's switch tray as [00:25:23] well and then connect them all together for make scale up or scale out system like this so add the tray [00:25:33] to the charge and the charge you can put the multiple charge into the rack and if you have a rack [00:25:38] and connect the right and the left side of the room to make a part right and then connect them all [00:25:44] together over the ethernet now they need to connect so this is a kind of infrastructure there is gpus and [00:25:51] the retainers and the retainers connected with the pci and cpu and they are all connected with a switch [00:25:57] an example this is a simple example to remind you what you're going to buy part today where it is what nice [00:26:04] so all the token is a sequence you can actually put into a single gpu that you know you can make them [00:26:11] process make the all different gpus process in parallel right so we call that a sequence parallelism [00:26:17] then what happened there is because this is a sequence distribute you know multiple the gpu and we need [00:26:23] to do exchange the informations so by far i give you the insight about the computation intensive or the [00:26:31] computation centric architectures and now think about the new model so do we really need to do or compute [00:26:39] for the kv cache that is right you really do multiply the multiplication for the every numerical problem [00:26:46] when you solve you know right you have time table right so why don't you put this kv cache into the [00:26:53] memories like a times table and then leverage it in a computer science we call that one is a dynamic programming [00:27:00] right so it's very simple to to address the decomputation problem why don't we just put the kv cache into the [00:27:08] memory spaces like this and the first tokens we're not going to write out all the data just put the [00:27:14] memory spaces like this right for the future case and then the token one and token two time then what you're going to do so you can just leverage token one and token two stored in your memory spaces to the [00:27:29] generate the next tokens and you can do this one for the the next set as well right and this happened already and that is one of the reasons why the kv cache is managed so like the test [00:27:41] dpd's or the your cloud you just leave the screen right that's the sequence right and this sequence [00:27:47] for inner conference need to have all kv cache so we have a particular stage to generate kv cache first and once you're done [00:27:58] then whatever you want to do and before the phase just leverage kv cache stored into your memory right to generate the next tokens [00:28:06] so for each token what you do is you have now 11 compute accesses the memory get the data and generate the next tokens so now the memory is the match right so decoding phase also including the number of kv cache that you need to care about why because you need to [00:28:28] you need to have a long context so the memory scores actually came out you know before and the kv cache occupied [00:28:38] it's going to the 75 percent of the gp memory so let me look to the kv cache part and what is the memory is the match so this is the architecture that we're going to talk about right and as they are all connected and they're going to do the exchange of the data right but don't forget the memory that is sitting on the data right but don't forget the memory that [00:28:58] that's sitting on the cpu side so if you have large number of kv cache which actually you know the capacities of the gpu [00:29:08] what do you do you have to store the data into the cpu side memory right going through all the pcrd and then you know loop up the data and then pulling all that from the memory to the gpu side so that's why i have it so then [00:29:28] the memory is now important the code open club and whatever this kind of thing is different with [00:29:36] the ai models that i mentioned before so generative ai like what i mentioned one question in an instant answer [00:29:43] right but for the agent ai the one go in and the plan and actions out right this is different between them [00:29:52] so usually what it does do is it's going to make multiple step actions so you give the users wrong then [00:30:00] the last language models uh make the plans and then the workers actually skip tools and everything [00:30:09] so let's suppose that you just give the the court is like a white tetris code like this then what's the [00:30:16] right there is the element the first to make the credit plan and the step and the screws or the workers [00:30:23] make the generated tool like the search github for the tetris right and then build out the tool [00:30:30] and the base of the result edit and update you know understanding decisions and then for the next actions [00:30:35] so it's going to make uh the generated code and evaluate and iterate to find out the actual tetris game [00:30:43] so now let's exam how kbcash is created and maintaining agent ai in the first step you're going to give you [00:30:50] the goal right and that goal is going to the event what it does mean by is you still need kbcash to [00:30:57] understand what happened there is right once you're done you're going to actually set up the actual goal [00:31:04] into the your control state okay and then the excuse excuse your tool carry out the test [00:31:11] and now there is a context buffers happens why does your executor generate output and that output should [00:31:18] be going to the element main right to make the enhancement so the model to rerun the element to [00:31:26] determine the next actions using using the user input and usable and it's going to make the the [00:31:32] critical kbcash one more time right and generate these things so the problem here is the generating [00:31:41] good ai also made many kbcash but the agent ai every editant and the tool requires a large memory right [00:31:51] so the memory to really become dominant this is not comparable to before so here is one of the [00:31:59] examples of the manual expansion so currently it's around 170 or 220 gigabyte and this is a generative ai's only problem [00:32:09] so the old requirements are typically constrained by the memory capacity available on a gpu and now the [00:32:17] sequence length increase right it's not now you can just put the order data into your age here i don't i do [00:32:24] not believe why it appears the scaling not just you know following this uh trend is quite high and if you see the [00:32:31] agent ai there's an expression time get lower right why don't you check the email checking you know is [00:32:38] slower than what you're going to do right so there is a huge time that you hold the expression and the context [00:32:44] and go too further it's going to make some errors right right and what do you do you have to do [00:32:50] do the multiple steps so it's going to go beyond what actually the nowadays memory is in five [00:32:58] so we try to you know solve this kind of problem [00:33:03] see that the generative ai computation demands are moving bearing like this so at the very beginning [00:33:10] the generative ai what they did was like what i mentioned before they are all interconnected right like this way [00:33:16] but now we have chemi cache what they did was you know you know the gpu to go to the cpu right and get [00:33:22] the data from the local cpu the cpu memory and then it's exploded so for the alien ai we needed something [00:33:30] more so if you see the number of lanes for the even cpu now the atomic lanes in there so in the reality it's a 32 or [00:33:38] 34 and the current is a 60 and more than 200 lanes it's a high power switch right and if you see the [00:33:47] other part it has been already enabled by the cxl 3.1 so what it does mean by we need to go through the cpu [00:33:55] and get the data with the memory semantics so i'm going to skip the uh the cxl part and keep it over [00:34:03] the cpu you know the memory containers can go out and see cxl 1.1 and then just put the data into the [00:34:11] memory site right and if you go to the cxl 2.0 you can expand over the switch if you go to the cxl 3.0 [00:34:20] what it means is that you can make the switch in a fabric so that you can you know make it all the [00:34:27] things is unified uh interconnect network so your accelerators can make the unified hdm over the [00:34:35] accelerators and the memory centric protocols or you can just access it from the accelerators [00:34:41] of memory correctly so now what happened you can actually expand the memories on a cpu side right [00:34:47] like this way over the cxl and you have a pci switch what do you do so as you have cxl and pci e [00:34:56] future switch what you can do is you can just replace the switch with a cxl then you can just [00:35:02] have a cxl by passing the ethernet and you don't need to pay the overware to accesses the remote memory [00:35:10] so that's what we wanted to do there is multiple tray accelerators memory cpu and switch and this is [00:35:17] going to follow up the work of demands in cases where you have more accelerators what you can do is you [00:35:23] can add multi-heated devices like this and it means that your cable attach goes through the mhd cxl devices [00:35:32] in a simple way or if you have more memory what do you need to do you just add switch with a multi-heated [00:35:39] devices and then you can actually recruit everything into the single pool to get more data so and the [00:35:47] near future we believe that you just use memory link and optics to figure out single unified memory [00:36:03] accesses spaces we call this a hyper six or the coordination before so we made this some kind of [00:36:10] process and we are preparing for the rack scale demonstrations at this end over these years and we [00:36:16] also i cannot reveal but there is multiple vendors that we are working with for the rack scale system [00:36:24] so this is a rack and this is a business model that we have id right and custom silicons and we have [00:36:31] a hardware product and the controller that we may talk about is obtaining without undermining anyone's coming [00:36:38] from the outside so we already have a pci-7 and the cxl4 and we have also big acceleration unit which means [00:36:46] we're going to remove all software and form air accesses as this is the memory controller it's not [00:36:52] an accessible so we have a pci-7 and cxl4 combo controllers it can do either the cxl or pci-7 and we made this [00:37:04] is a pipe switch cxl3.2 pipe and out switch which can support all things as a failure and there is a [00:37:13] controller that we already made and there is also the big acceleration unit that removes all the format [00:37:19] and the switch side and there is a pre-related solutions inside a 6.4 and 6.0 and 3.2 fusion switch [00:37:29] and it has to be done on the pan4 signal test as well and you can see the very good eye right it's a [00:37:36] quite good eye to get a full different signal curve first and we are not relying on being for the board so we [00:37:45] have a multilayer board still in technology and the switch we made the multi-layer board and we bring it up [00:37:53] for the fusion switch part which can be integrated into the just node or the train [00:38:00] i like this one so the hardware such product so this can punish the pci-6.4 and 6.2 fusion switch [00:38:08] part it's compatible with the commercialized software product this one it's not a big one right this is [00:38:15] what we provide as a previous solution we are preparing a large amount of things which is much bigger [00:38:22] so as we have our ip we can make a retainer and an endpoint and we provide a full-stabic solution to [00:38:29] which we define the air across structures from the ip controllers and the custom solutions and the network [00:38:37] and we try to make a full solution for the connected air so and please let me know if you have any [00:38:43] questions thank you for you know attending and have a list to speak

Related Transcripts from Panmnesia

Transcribe Any Video or Podcast — Free