About this transcript: This is a full AI-generated transcript of Become AI Researcher From Scratch - Full Course - LLM, Math, Pytorch, Neural Networks, Transformers from Vuk Rosić, published June 9, 2026. The transcript contains 27,566 words with timestamps and was generated using Whisper AI.
"welcome to the full course of becoming ai researcher i will guide you step by step through mathematics and then pytorch fundamentals neural networks and lastly transformers after finishing this course you will have all of the fundamental knowledge necessary to start doing your own ai research to..."
[00:00:00] Speaker 1: welcome to the full course of becoming ai researcher i will guide you step by step through mathematics and then pytorch fundamentals neural networks and lastly transformers after finishing this course you will have all of the fundamental knowledge necessary to start doing your own ai research to start understanding others ai research and publishing your own papers first we will start from basics talking about functions in mathematics different types of functions and what each term in a function determines then we will talk about derivatives how derivatives show a rate of change of a function and how they are used later in neural networks after that we will learn about vectors vectors are present everywhere in large language models reinforcement learning robotics and neural networks and anywhere else you can imagine pretty much they are used for everything then we will combine our knowledge of vectors and derivatives to figure out gradients gradients are used for learning of the neural network gradients is how neural networks learn so you will understand what they are how they show the direction of steepest ascent how to use them to make neural network errors lower how to make how to use them to make neural networks learn and by the way right now i'm also explaining to you your roadmap your path to become ai researcher what you need to learn so keep track of what i'm explaining this is not just an intro this is your roadmap to becoming ai researcher so you can self-study as well next are matrices and matrix multiplications these are maybe the core of neural networks besides gradients they are done on gpus neural networks are all about calculating matrices and doing it quickly and efficiently and lastly in the math module we have probability this is used also absolutely everywhere neural networks large language models reinforcement learning robotics everywhere video generation diffusion image generation then there is pytorch module the first lesson includes creating tensors understanding tensors understanding data types then matrix multiplication different operators and how matrix multiplication works in pytorch then transposing flattening reshaping viewing squeezing unsqueezing tensors all of these things are gonna be used very frequently in ai research so you need to understand and master them and i have a lot of examples for you to play around you can find this github repository below the video the first link in the description you will also learn tensor indexing and slicing so if i tell you i want last two rows and out of those last two rows i want second to fifth columns and so you need to be able to quickly write indexing logic then we will talk about special matrices and arrays like identity matrix random matrix lin space etc after pytorch module we will work with neural networks first we will create a single neuron understand what weights and biases are how they influence the output how they are calculated then activation functions how they turn the straight line of weights and bias and input multiplication into a curved line that can encircle data we will talk about different activation functions and what they are useful for we will also talk about multi-layer neural networks and how to have entire layer of weights in a single matrix and how that matrix multiplies inputs and how the results get transformed so we will be building neural network layers from scratch coding it step by step with all of the functions explained step by step in this course as well as back propagation and how neural networks learn and finally in the last module we will talk about transformers transformers are a base of today's large language models as well as image generation video generation audio generation and almost everything you will understand the attention mechanism and then on top of that of the transformer and how all of that works as well as tokenization and this will be a good base for building your own large language model as well as video generation image generation and other models so that's the roadmap that we will follow you can also follow this roadmap by searching other youtube videos and self-studying and if you are ready we will begin thank you be may i for sponsoring this video you can automate your business processes and operations without any technical skill or coding required here i can just go on to this chat and say create custom agent with be may i and then i can just type create me an ai agent for order processing this can be email forwarding data processing data entry google sheets whatever you need in your operations there is no coding or technical skill required you can just write in natural language like you are talking to a co-worker and it will automatically generate the workflow for you it's gonna understand your intent then prepare your agent generate prompt generate tools connect external tools apis etc so here it gives me this suggestion i can give feedback or just click create and then i will just continue talking in natural language i need you to validate order against inventory and it will give you whole suggestion of the workflow step one step two step three everything and then you can adjust or you can just say this looks good and create this workflow and you see that automatically created entire workflow with all of the tools automatically generated and you can adjust all of them very easily everything from emails customer support hr onboarding finance refunds order fulfillment and a lot more can be automated with just simple natural language descriptions without you needing to do prompts building complex workflows or caring about connecting external apis everything is done automatically by beam you don't need any coding or technical skills you can just explain to it in natural language like you are explaining to a co-worker you want it to generate invoice process invoice so let's see what's going on in natural language so let's see what's going on in natural language here is an example of how easy it is to use beam ai so let's say i got an email in my previous node previous step then i'm gonna just click this plus button and insert a step and then i'm gonna say extract competitor data from email and then i'm gonna click add tool and generate tool and it will use my text here that i wrote and then it's gonna generate entire prompt it's gonna be very detailed and professionally done as if an expert prompt engineer is writing this so you can just click here a few times and forget about it it will automatically do everything and just after one to two minutes it generated this whole new tool with very professionally designed prompt and the fields it's going to extract and this data can be easily used in the following notes you add to save it to google sheets etc click the first link in the description or go to beam.ai to request access and start automating hello guys this is mathematics for ai research if you understand and learn the math i'm gonna explain in this video you will be able to take any ai research paper and understand math within it as well and you will be able to continue doing research in any part of ai that you wish so is it large language models transformers video generation diffusion image generation reinforcement learning robotics whichever part of ai you are interested in this is fundamental math that appears in everywhere this is gonna be part of my course on creating and training large language model like kimmy k2 thinking from scratch but this math applies to the entire field what whichever ai research you are doing you're gonna have to be learning this math as well if you click the first link below this video you're gonna get to this repository become elite ai researcher and the first module is math which what i'm gonna which is what i'm gonna explain now we have all of these jupiter notebooks so requirements here is that you understand basic python so here we have some array we have some for loop we have some variables so this is all basic python so if you are not familiar with basic python you can just type a python tutorial on youtube and you can check whichever one you like maybe this one or actually hardware classes are actually very good so i want you to know how to execute python code it can be on your local machine or it can be in google collab so now i also want you to say google collab tutorial if you want if you want to use google collab but also you can just run it on your local machine so whichever one you prefer also if you're using vs code or cursor or some fork of vs code you can install this brand new google collab extension so if you just search for collab it's by google install it and then you can connect to google collab gpu cpu tpu from your local environment from your local jupiter notebook so now on my github repository in the math folder i want you to open this first jupiter notebook so you can copy this and open it in google collab as i explained you can ask chat gpt how to do it or open it in your local or however you want to open this and then i have it here on my local and we're gonna go through this so if you install google collab extension you can click here google collab and we can say open google collab web but i don't want to open this on web i want to say um select kernel with which to activate this and select another kernel and then collab and then i can create new collab server and then i can choose cpu in this case because i don't need to run this on gpu but keep in mind now you can write a name i can just use the default name like this press enter but you don't actually need to do this if you want to run the python code on your local computer so i'm i'm showing you how to run it from your local using google collab google's gpus or cpus but you can just run it locally as well and so if i click here collab i can click remove server if i want to shut down the server but anyways first depending on which kernel you are running here if it's google collab it will have this numpy matplotlib already pre-installed but you may also want to create your own environment and this is something you would learn how to do on youtube or from chat gpd and you need to install these requirements so if you try to run this but you get error like numpy or np is not defined then you don't have these requirements installed so make sure you enter your console and make sure you are in the folder where the in which the requirements are so this is the outside folder in which requirements file is and then i'm going to say pip install dash r requirements dot txt but make sure to create your environment install this in a python environment and then just press this enter but actually i don't want to install pytorch because for the pytorch tutorials i'm going to use google collab so i'm just going to install numpy and matplotlib and then i'm going to comment out pytorch so let me do the installation again okay now this is already installed so let's do let's start with the maths now that you have everything set up everything installed i will show you the curriculum first we're going to learn the functions so these are math functions not python functions but math functions then derivatives then vectors gradients matrices and probability a lot of you may already be familiar with some of these especially maybe with functions but you can use timestamps below or chapters to skip but maybe you can also watch it anyways so function will map from one value or from one set of values but let let's make it simple you have some value x equals one your function will transform it into some other y like y equals two based on some rules for example if we have function that y is equal two times x then if x is one y will be two if x is five y will be two times five is ten and i want to draw this so as x goes zero one two three four the y is gonna be the double so here for example at five y will be ten so then for every x we're gonna take and draw corresponding y so for example for 2.5 right here y will be five for two y will be four right here for five y will be ten so if we then just connect all of these dots we draw all of these dots we get this line so this is a linear function and this is how we drew this is where the plot drawing happens so you don't need to understand this part at all this is just to show this plot so let's just understand this part this is our function so we will create an array of x values between 0 and 10 and we're gonna create 100 x values between 0 and 10 which is including 0 and 10 so we you see we have x from 0 to 10 and in between we have 100 small dots and for each of these we're gonna just calculate the y corresponding y values so two times two times these x values so y will also be an array of 100 numbers but each element will be two times the corresponding element in the x so if it's three it's gonna be six here four will be eight etc and then we just use this to draw and that's how we draw the linear function here and let's see if instead of having number two here we have some other numbers so y is equal to 1x 2x 3x 0.5x minus 1x so this will this number will determine the slope of this linear function the 3x which is the highest number here will also have the highest slope the lowest number minus 1 will have kind of highest negative slope in negative direction on negative x for example if we look at y equals x here where x is 4 y will also be 4. so then that's this slope as well and the way this is coded is very simple there is a list of k values and then we just plot for each of the k values we plot the function i recommend using uh ai chat gpt or cursor to explain any of these things for you if you don't understand it but i want to focus on the theoretical part of the math for now so this coefficient determines the slope of the function then there is quadratic function where y is equal to x squared so as we go on x slowly the y will become a lot bigger will start to accelerate so in the beginning y doesn't grow so quickly but later when we go from two square three square four square five squared the y starts to become a lot bigger a lot faster here like zero squared is zero one squared is one two squared four three squared nine four square sixteen so y starts to grow very fast you can play around with this code to see and to modify to see what the drawing will look like now notice that y is always positive here so y never goes down below zero on y axis we never go down because square function quadratic function will always make everything positive because if you multiply two negative numbers or negative number with itself it's going to become a positive number so if you square a will become a positive number but that's not true for the cubic function where putting a negative number cubed will give you a negative number so this is what cubic function looks like and it grows even faster than square function here at three we are already at 27 on y and three was just nine in the square here so cubic grows even faster but there is also negative part of it so this is the shape of the cubic function this is the shape of the square function so knowing these shapes is good square root function so y is equal to square root of x i'm gonna have the same code here so i put some numbers uh numbers uh you can only find roots of positive numbers by the way you cannot find roots of negative numbers it's not defined or it's not real i should say and so this is what uh y equals square root of x looks like this is also good to know this kind of function so there is no limit here it will also grow forever as x grows forever y will also grow forever then we have exponential function where y is equal to 2 power x or some some some number some constant power x and this grows very fast this grows faster than the quadratic so 2 power x grows faster than x power 2 it's good to know that and this is the shape of exponential function the code is very similar this part is just different so y is equal to 2 power x this is how you can denote power in jupiter or python we also have sine functions so y is equal to sine x and i'll show you first that this is a periodic function and this function is important it's a trigonometric function it's used everywhere it's used so often and in python you can just write y is equal to math dot sine of x now let's go back to the linear function so this is also a linear function so we have some two times x which is coefficient and the variable plus some constant or minus some constant that doesn't have times x in it or so if you change this number three make it four five six you will just move this function left or right so the slope will be same so if you change this number two then you will change the slope of the function but if you change this number three you will just move it left or right this is also important to know and now we may take a look at these functions and so that's gonna be my functions i recommend you play with this make sure you understand this and we can go on to the next lesson next we have derivatives i'm gonna try to explain the main points of derivatives so if you have a so if you have a function you can find the derivative of that function that shows the rate of change of the function how quickly it grows at each point so you see this linear function is always growing at the same rate so this function is not like quadratic where it's growing a bit slower here a bit faster here etc this linear function is always growing at the same rate in this case the derivative is gonna be equal to this number two so if we have two x the way i'll show you later how to calculate derivatives but the derivative will be this two so at any point when you increase x by a little bit when you go in the x forward by a little bit the y will go by two times that amount you went on x so you move x a little bit the y will go up by two times so that's so here you can see the derivative is two at any x so at any x here the derivative is always two because it's a linear function the growth is always the same can you try to explain yourself what derivative is can you try to explain what i just explained uh yourself so derivative shows rate of change and it's linear function so rate of change is always the same and it's two in this case it's equal to this number two of two x so in my code i have the function f of x and the derivative is just hard coded here it's always two because i'll show you later how to calculate it but this will be derivative two let's see derivative of this quadratic function y is equal x squared so the derivative here will be 2x and i'm gonna explain how it's calculated but we can see that here the function doesn't grow equally everywhere so actually as we increase or go further into negative on x the y will start growing a lot faster here around zero x is growing but y is not growing up so much but as we grow x the y grows a lot faster so now we don't have a constant derivative that's always the same like two the constant growth here we have uh derivative looks like a linear function so the relative here is 2x so the derivative itself can be drawn as a function by the way so we have one and two for this spot so funnily enough you can also take derivative of this derivative so that would be second derivative so here the growth will be just two derivative of 2x will be just two but let's not get confused so here we have the function x squared the derivative is 2x because the formula is you put this so if you have x to n you put this number two number n in front of x so you get 2x you just put this number in front of x so you get 2x and you subtract 1 from this number up here so 2 minus 1 will be 1 so you will have x to power 1 so that's just x so the derivative of this is 2 times x power 1 which is just 2x that's the formula for calculating derivatives you can also check other tutorials now this can go quite deep um i think it would be good if you knew it but it's not absolutely essential to master this i'm just explaining these are the things you need to understand you also learn this in high school so you will encounter this in some research papers and it's good to know this stuff now do you want to dive deeper right now or later it's up to you so what this derivative means is that um any of these points wherever you pick x for example if you look at x equals 2 then that means here on this original function if you if at x 2 you increase x by just a little bit then the y is will increase by four you see here it's number four so if you increase x by a bit the y will increase four times because at this particular uh place let's calculate the derivative derivative is two times x so two times two is four so at this particular place if you increase x by a little bit the y will increase four times that little bit you of your increase here at three derivative is six so at here at three if you increase x by a little bit the y will increase six times that little bit and so you can see that y is increasing faster and faster as you increase x so this may require a bit of practice a bit of maybe pen and paper you can try to think about this yourself maybe watch some specific tutorials on derivation if you wish so now or maybe even later so we have this quadratic function and i want to show you at each of these points whichever points we pick zero one two on the x axis each point has its slope so uh because it's quadratic function we can see that there is no slope at zero x equals zero there is no slope because y is zero x is zero here uh at one there is a slope at x equals two there is even higher slope so this slope is basically the derivative i told you that at number at x equals two if you increase x by a little bit by 0.001 for example the y will increase by four times that amount now i'm not gonna tell you that if you increase here x by one so you go from two to three the y will increase by four because i'm not gonna tell you that because as you move towards three the y rate of change will also be increasing keep in mind that the slope will change that's why i say if you increase so just around these two a little bit around this number two the slope is four okay as you move towards three the slope will become six it will increase so that's why i say if you just increase x by a little bit here the slope will increase by this amount four times or six times but if you increase x by a bit by by a lot more then the slope will also change so in summary the derivative will show you this slope it will show you rate of change around this point and for this particular function the slope or the derivative will be equal to two x so here if x is one the slope is two if x is two slope is four if x is three slope is six etc now let's see some derivation rules so if you have x power three you're gonna put this three in front of the x so you get so the derivative will be three times x and also you need to subtract one from this exponent so three minus one is two so the final derivative will be three times x power two and it's just x power two so it's x power two times three and so you can see these examples like uh derivative of x power five is gonna be five times x power four you just subtract one here so it's x power four times five and this four is not like it's not the whole thing it's just x power four and so if we have a function y equals x cubed or x power three then the derivative will be three x squared according to the rules so this is the derivative function it's three times x squared and so we already know that x squared it's gonna be like this u shape we discussed that earlier and then three times x squared so it will just make the y's even larger it will maybe squish it like so this coefficient will just change this how wide or thin the u shape is but basically because we have x squared we're gonna have u shape so that's the derivative it shows how the function is changing at any of these points some other derivation rule so if you have a function 2x plus 3 you can find derivatives of each of these separately so around plus or minus so you can find derivatives of 2x and derivative of constant 3 and the rule is that constant 3 any constant becomes zero so it will just be plus zero so we can omit that and derivative of 2x will be so the rule is the following if you have constant times variable the constant will remain the same so 2 will remain the same because it's times variable okay here this constant two is not separate so here 3 doesn't have any variable attached to it 3 is not multiplying a variable so 3 disappears it becomes zero but here constant 2 is multiplying a variable so it remains 2 in the derivative this constant 2 will remain 2. it's explained well in chat gpt so as i said derivative of a constant it's a alone constant it doesn't doesn't multiply variable like here 2 constant 2 multiplies variable so this is not a lone constant so if you have a lone constant doesn't multiply variable then derivative is zero so this part will be completely deleted omitted and then derivative of 2x is 2. let's see how so 2x is equal to 2 times x power 1 and we know that i explained earlier that if we find derivative so first of all because we have constant times variable we will just this constant will remain the same so in the derivative this constant will not change so in the derivative this will still be 2 times something okay so now let's see what that something is i told you the rule is whatever power you have here you put it in front of the x so 1 will go in front of the x and then you will subtract 1 from whatever power we have here so we have 1 minus 1 so 1 times 1 minus 1 so that's 1 times x to 0. x to 0 is just 1. so this is equal to 1 a number to 0 is equal to 1 so that's gonna be 1 okay so then i told you we will just we will just get 2 times 1 because as i said 2 multiplies variables so it remains the same in the derivative and then we apply this rule x to the x that i explained so it will be just 2 times 1 and so this derivative of 2x is 2 so at the end derivative of this is 2 derivative of this is 0 so we get 2 plus 0 we get derivative of this whole thing is 2 so you can practice this this is done in high school university you can practice this yourself if you want we chat gpt or some other ai or youtube videos and i can show it to you here so the function is 2x plus 3 the derivative is 2 it's always 2 for any x when whenever you increase x the y will increase by 2 times whatever the increase in x was in this function because the derivative here is always 2. so this means that plus 3 constant doesn't affect the slope of the function as i said the slope is determined by this number 2 and so if i change this number 2 the slope of the function will change but this constant plus 3 this will just move the function left or right so this constant plus 3 will not change the slope so it so it makes sense that it disappears in the derivative as well it becomes zero it doesn't affect the slope and now i just want to summarize derivative so we have linear function the derivative is constant we have quadratic functions so x power 2 the derivative is linear so x power 1 is the derivative we have a cubic function x power 3 the derivative will be x power 2 it will be quadratic so derivative is kind of one degree less than the function so that's going to be the core knowledge you need to know about derivation you will eventually learn more as you read papers but for now this is enough third lesson let's learn about vectors so this is this is maybe the most important one besides matrices as well uh everything is in vectors in machine learning neural networks first we will look at a vector as just an array of numbers so here we're going to create a numpy array hopefully uh you understand what numpy is you can watch youtube if you don't understand this python so these are fundamentals you can learn that first so array of just two numbers in this case three and four vector can be array of any amount of numbers so we can also look at these numbers as coordinates on a coordinate system so x is three y is four we can look at that this way if we want so this blue example it's gonna be x is three here three and y is four so that's the vector the spot so we can draw vector by going from zero zero to that spot and you can see that vector is like an arrow so vector has a direction and length direction and length we can also look at a vector as going like uh three units this way and then four units this way to end up at this spot and here we have two other examples of minus so minus two one so going minus two on x and one on y minus two one and zero five so zero on x five on y so this vector will be on the x sorry on the y axis then let's look at vector addition so if you have vector three two and vector one three if you add them you're gonna say you add the first elements so you add three plus one the first elements will be four and the second elements two plus three will be five and we can also add vectors visually starting from the first vector three two going from zero to three two and the second vector one three now this vector would go here from zero zero to one three like this here so this would be this second vector uh and then we will just move the second vector to the end of this first vector if we want to add it visually and so wherever it ends up that's gonna be our resulting vector so the resulting vector will be four five we can draw it from origin to four five if we want so that's how you can add vectors visually as well let's see multiplying vector with a scalar so if we have vector two one what happens if we say two times that vector 0.5 times that vector minus one times that vector so this blue is our original vector um so it goes from zero to two one now multiplying with a scalar will stretch or squish or inverse so make it go in the other direction so two times vector two one will give us vector four two so you multiply each of the elements so two times v is gonna be two times two it's four and two times one it's two so the resulting vector will actually be two times uh longer and it will have same direction and multiplying with zero point five it will just make it two times shorter instead multiplying it with minus one will have the same length just the opposite direction so it will be minus two minus one how do we calculate length or magnitude of vector how long is this vector so it's actually similar to pythagoras theorem so you can just say x squared plus y squared you add them up and then you get the square root of the result but in pythagoras theorem you just have two numbers but here you can have any amount of dimensions so you can have x squared plus y squared plus z squared plus a squared so you just add all of the squares of every dimension and at the end you take the square root of of that sum so let's say we have a vector that has four dimensions let's say this is the vector uh and then we just square each dimension sum them up let's say it's 30 the square of the sum of all the squares and then take the root of 30 and that's the length the magnitude of your vector so it's always the square root it's the it's not like fourth root here it's square root always here are some examples if we have vector one one the magnitude will be one point forty one so that's the length the magnitude of the vector so you can see here that this is a bit longer than this one unit so it's about one point forty one here the vector zero five the magnitude will be five here you can already see that the length of this is five so okay and then vector three three four um three four the magnitude the length of the vector will be five units and if you remember from pythagora pythagoras theorem if you have triangle with sides three four the hypotenuse will be five but what's gonna happen often is if you have some vector that has some length you will want to squish it back so it has unit length length length of one uh this is to make vectors in neural networks stable they don't because you are multiplying a bunch of them so they're gonna start exploding the resulting vectors will become exponentially or quadratically or a lot bigger and bigger and bigger as you multiply them or they will become a lot smaller and smaller and smaller and vanish so you will want to normalize vectors you will want to make vectors of unit length of length one let's say we have a vector three one you will find its length which is going to be 3.16 in this case and you will just divide this vector with its length with its magnitude and when you divide so dividing it means you will divide each dimension with the magnitude separately and then this new vector it's gonna be unit length so it will have same direction same direction it will just have unit length it will not be it will have be long one one unit whatever the unit is so neural networks like numbers around one because when you multiply a bunch of those numbers it will not start to the result will not become very huge or very small which is not good for graphic cards they cannot represent very huge numbers or very small very big numbers or very small numbers so you want to keep your numbers around one zero minus one something like that now let's understand dot product so dot product shows you how similar two vectors are here on this image you can see that we have vector a and vector b and these two vectors are closer together so which means they are more similar than here where there is 90 degree angle between them or here where they are very separate so these two vectors are most similar these are less similar these are least similar maybe i should say aligned instead of similar but word similar is also used very frequently even maybe even more frequently so it's very simple we will multiply first dimensions so here we have a vector let's say three four and the second vector is one two to find dot product or similarity between them we multiply three times one so we multiply the first dimensions and then we say plus multiplying second dimensions four times two okay so three times one is three plus four times two is eight so three plus eight is eleven so dot product between these two is eleven so if our number is positive like 11 then they are pointing in the same direction okay if the two vectors are perpendicular 90 degrees angle between them then the dot product will be zero okay so if you get the dot product zero you know that these vectors are perpendicular and this is useful in muon optimizer this is this may be the new way to make neural networks learn a lot faster better i'm gonna talk about this in a different part of the course so just know that if vectors are 90 degrees between each other they are they will have dot product zero and if you have dot product zero that means the vectors are 90 degrees and if dot product is negative then the vectors are even further away than 90 degrees between them so they are even more different or dissimilar in python we can calculate dot product by just calling p dot dot dot v1 v2 vector one vector two so uh vectors v1 and v2 that we calculated manually have dot product of 11 now these two vectors three four and minus four three have dot product of zero which means they are perpendicular so v1 and v3 you can see that the angle between them is 90 degrees so they are perpendicular so dot product will be zero and then v1 v4 so v4 is actually minus three minus four so the blue and the yellow they are very different because they point in very different directions opposite directions so their dot product will be like negative number minus 25 it's gonna be large negative number number here because they are very different subtracting vectors is very easy if you have vector four three you want to subtract vector one two you will just subtract each element from each so four minus one that's gonna be your first element which is gonna be now three and the second element will be a result of three minus two which is gonna be now one so the new resulting vector is three one we can also show that visually so if we have this blue vector which is four three and we have red vector which is one two we can draw a vector from the end of this one to the end of the first one and then move it back to the origin and that's gonna be our new vector which is gonna be three one our resulting vector so you will start from the second vector that you are subtracting uh moving going to the end of first vector from which you are subtracting that's your new vector so then you can move it to the origin to see its real coordinates and that will be it for vectors and see you in the next lesson let's talk about gradients let's say we have some surface like this so at any point of this surface you would want to know where is in which direction to move to go up in which direction to move to go down this is used in neural networks to know if for example this surface would show error so you want to make errors small so you want to go down on this surface so how do you in which direction to move to go down on this surface now i know if you've never heard of loss function and in neural networks and this is confusing like how is this showing an error but we will talk about this later but just know that if you have some surface you're gonna use gradients to know which way is you need to move to move to go up which way you need to move to go down because the gradient will show you the direction of the steepest ascent so in this direction if you want to go up fastest so that means that in the opposite direction you're gonna go down fastest for this particular for this point where you are standing so very near around that point so that's what gradients are used for for every point on the surface they show you in which direction it's ascending increasing and so in neural network we will just go in the opposite direction to decrease because the neural network it will be used to represent this surface will be used to represent our error which we want to minimize the error and gradients are closely related to derivatives you know that derivatives are showing you the slope so in gradients you will also use derivatives to see the slope but we will use partial derivatives so let's see what those are in this case we have a 2d surface so two dimensions we have x dimension and y dimension and we want to take partial derivative with respect to x and partial derivative with respect to y so what does that mean so if we stand at any point if we just move let's say on on x direction positive x direction we increase x one two three four do we go up or down on this surface so we want to examine if we move on x so how when we change x do we go up or down are we ascending or descending so that's going to depend on where we stand at the surface so we will take partial derivative of this function this loss function or this surface with respect to x at that particular so we will know so when we plug in our particular x coordinate in which we are currently standing into that partial derivative with respect to x we will know if for example if we and when we increase x if the function will increase or decrease so there's the partial derivative with respect to x and it's partial because there is also y so we need so we have two partial derivatives with respect to x and with respect to y so then for that same position where we calculate if we increase x does the loss increase or decrease we will also calculate for y if we increase y does the loss increase or decrease so if we find that as we increase y coordinate the loss will increase then we want to go in the opposite direction we want to move in the y decreasing direction so let's say this is y increasing we want to move in the y decreasing so the loss will also decrease as we move so that's in the case if we find that partial derivative with respect to y is such that when we increase y the loss is increasing okay and it works the other way around if we find that if we increase y the loss is decreasing then we just want to keep moving in that y direction and so the x direction and the y direction will give us um well the gradient at that spot will show us the direction of the steepest ascent steepest increase so we will just go in the direction of the steepest decrease the opposite direction so these are gradients uh we're gonna see some examples to learn more about this you can also check some youtube videos this may take some time a few iterations a few under times of you understanding this to finally grasp it gradient we will just be an array of numbers and the first number in this array will be partial derivative of the function with respect to x so it's just a derivative so if we increase x was the change in the function was the slope so it can be one two three minus one minus or minus one point five or whatever the number is so we have a derivative of the function with respect to x and derivative of the function with respect to y and i should say partial derivative so that's the gradient so if we have a function x squared plus y squared here these arrows will show you the direction in which it's increasing most from that particular point where you are standing so x squared plus y squared will look like this it will look like a ball like a ball and so here at the bottom there will not be too much of a slope so these arrows will be small the magnitude the length will be small because there is not too much slope but as you go to the sides the slope will be higher larger and so the magnitude so each of these vectors will show you how steep it is and it will also show you the direction in which it's steep the longer the vector the longer the gradient the steeper it is and it shows you direction in which it's steep so if you want to move down then you would go in the opposite direction of this arrow if you are standing at this place you want to go down you want to go opposite this is why it's called gradient descent in neural networks you want to go opposite of the gradient because gradient shows deepest descent you want to go down the loss function this is what x squared plus y squared looks like it's like a ball now let's see how to calculate gradient of a simple linear function 2x plus 3y we have two variables so first in our gradient the first element will be partial derivative of x with respect to the function so we will treat y as a constant if you remember in derivatives when we treat y as a constant then this will be this whole thing will be a constant so it will become zero it doesn't multiply x because right now partial derivative with respect to x will just treat x as the variable so now we will just have two x as the derivative of this oh sorry sorry two as a derivative of this and then when we do partial derivative of the function with respect to y then we just treat y as the variable and x as a constant so we have two x that's a constant that's whole thing is a constant it becomes zero so derivative of three y will just become three so here we have partial derivative of a function with respect to x is going to be two and partial derivative of the function with respect to y is going to be three so it's just gradients and derivatives in this case we don't have any variables in the gradient so we just have constant two and three which means for this function the gradient will always be same it doesn't depend on x or y the gradient is always the same so these arrows are always going to be the same they will always this function will always rise in this way it's like a plane it's always the same always rising in this way for the same amount is like a plane but if we have maybe a curvy surface like this so this surface is x squared plus y squared we will notice that at the bottom it's a bit flat so there is no so much rising or ascending or ascend so the vectors gradient vectors will be smaller but as we go up these values will grow quadratically they will grow faster and faster so as we increase x and y values the slope will be bigger and bigger and bigger so these gradient arrows will become larger and larger as i said the gradient will show the direction of steepest ascend so if we have this function x squared plus y squared if we find first of all the gradient there is gonna be 2x comma 2y so gradient partial derivative with respect to x will be 2x partial derivative with respect to y will be 2y so our gradient for this function will be 2x comma 2y i can show you here the function is x squared plus y squared this is how you can write square in python the gradient will be 2x can write square in python the gradient will be just 2x and 2y so for let's see some point like 0.21 so x is 2 y is 1 we're gonna get 2 times x is 4 2 times y is 2 2 times 1 is 2 so our gradient at 0.21 will be 4 2 uh 4 2 will be our arrow right here but it's not shown fully but anyways that's it points to the steepest ascent so we want to take the opposite arrow to point it at the descent where how to go down the function now let's learn about matrices this is extremely important this is how neural networks work this is the main thing about ai neural networks so this is a matrix it's an array of arrays okay so you can look at this first row as one array so one two this is the one array and then the second array is three five so that's one matrix so each of these numbers has its index so starting from zero so this is the zeroth row so the first row is actually has has index zero just like in normal programming this is a row with index one okay column with index zero column in index one and so this element has index zero zero and then this has zero one one zero because this is a row so first it's row row one and then column zero so one zero and then one one so if you want to add two matrices you just add each element corresponding elements at the same index so one plus two is three two plus five is seven etc this is how addition works matrix multiplication is the most important operation in neural networks in machine learning in ai so you have two matrices first of all this number of columns here three must be same as the number of rows here three so these three and these three must be same so we have one two three columns one two three rows if you swap these places uh you will get a completely different resulting matrix so you cannot swap places of matrices and get same result you will not those are that's a different operation this is how you multiply two matrices so you say you do a dot product between this row and this column and and they must be same length so three times zero plus one times two this is the classic dot product we talked about this earlier in the video plus zero times zero okay that's the formula here two so that's that number two goes at the intersection so it's gonna be a first row and first column so first row first column here that's this dot product result and then you will go for the second one you will go so you go first column sorry first row and second column here so you go three times four plus one times two plus zero times one and result goes here so you just put the result at the intersection so first row second column first row second column and at the end you will get a matrix that's that's size of these outside dimensions so two times two because this middle dimension will cancel out and this middle dimension must be same equal and this outside dimension outside doesn't need to doesn't have to be the same so if this was two times three this matrix and this matrix was uh three rows and four columns then it's possible to multiply because three and three inner dimensions are equal and the resulting would be two times four if this was four in python in python you would define matrix as array of arrays so this is the first row this is the second row here it's being indexed as one one so here indices are starting from one maybe to show like example to for you to understand it easily but usually indices start from zero zero first row second row just like here so this is the first row this is the second row and this has two columns as well so two rows two columns and so that's how we can define matrix in python with for example numpy we can also do it with pytorch and stuff and this is example of three by two matrix so three rows one two three and two columns now these rows and columns are gonna take some time for you to understand intuitively quickly what's happening like two by three matrix two rows and three columns so we can print shape of the matrix so a matrix it's this one so shape here is two by two and then we can print shape of the first row and shape of the first column so let's see shape is two by two rows two and columns two it's maybe easier to understand on this example because here rows and columns are different shapes so shape is two by three rows two columns three so if you say uh print a dot shape zero that's gonna be the first number in your shape so it's gonna be this number two and shape one of one with index one will be the second number three uh maybe you want to play with this a little bit to see what happens so this is gonna show the number of columns this is gonna show number of rows and so if you have a matrix one two three four five six you can access individual elements by just calling the index so this index of this would be zero zero so if you say a of zero zero that will give you this value which is in this case the value is one if i put it like this it's gonna be 10. so this value has index zero zero and that's how you can access the elements i'm actually gonna talk more about this indexing and slicing later it's more for maybe pytorch tutorial so let's go on to matrix addition i already explained this so here you would just add each element with on corresponding on the same index so one plus five is going to be six the resulting will be six and then two plus six result will be eight so same index scalar multiplication you just multiply so if you say two times matrix a you will just multiply each of these elements with two so this will become two four six eight so let's see the result here two times a you will just multiply with two or 0.5 times a you will just multiply each element with 0.5 0.5 once 1.52 minus one same just each element minus one matrix vector vector multiplication is just like matrix multiplication so you first need to make sure that the inner dimension is same so two times three so this three must be same as this three three times one in this case you can multiply a multi matrix multiplied vector v but you cannot multiply vector v matrix multiply vector a because then this inner dimension would not be same so you can first put this first and then this second and then that's how you can matrix multiply it and then the result will be two times one because the inner same dimension will cancel out and the step-by-step calculation goes like this one times two plus two times three plus three times four and then that's gonna be 20 and then four times two plus five times three plus six times four that's gonna be 47 so that's the resulting vector so you multiply this row with this row now this could be a bit confusing because i was showing you that you multiply row in the column uh but depending on how you do it how you code it how you imagine it this can also be done this way because this is just a vector so in matrix multiplication the way we imagine it conceptually is row times column but since this is just a vector we can also imagine this as a column vector being two three four and then this this row and this column vector so this is a bit confusing you're gonna have to do this by yourself by hand try to multiply try to figure out try to get intuition try multiplying this try multiplying others maybe uh get some other examples from ai from chat gpt so the matrix multiplication example here i explained it so you can use this at as matmul or you can use np dot dot which is a bit confusing because this implies dot product but this is actually this is matmul and so here you have step by step on how to multiply two matrices but this is just something you would need to understand and i saw sam altman explaining something about matrices and he was showing this with his arms or hands so that's how i know that he also knows how to multiply matrices i was wondering if it was in his job description as ceo of open ai to be able to multiply matrices i was always wondering that this is uh transposing matrix so we have one two three four five six we transpose that it becomes so we swap rows and columns so it's now this is this column now became the row so one four now became the row two five now became the row and the shape is swapped now then we have identity matrix where it's one on the diagonal and others are zero and it's a square matrix it has equal number of rows and columns and when you multiply any matrix with identity matrix it will remain the same matrix so it the result will be the same as this matrix that you multiplied with the identity matrix and in python you can create identity matrix by saying np.i and then how many rows and columns so two or three or whatever the number is so this is the two this is three by three it can be any number so identity matrix acts like the number one in multiplication so it doesn't change the thing it's multiplying here are some properties you can analyze them if you want i just want you to understand that in matrix multiplication you cannot swap the order of matrices this is not generalizable it's not true in general there might be specific cases when it's true but in general it's not true that's gonna be it on matrices if you understand this you understand everything you need to know about matrices as your fundamentals so let's check out the last lesson as well lastly let's learn about probability this is extremely important it's used in large language models reinforcement learning computer vision natural language processing everywhere so you have a coin let's assume it's a fair coin you flip it you have 50 chance to get heads 50 chance to get tails or you roll a dice six sides so you have one over six chance for any particular side to appear to be rolled probability measures how likely an event is to occur zero means impossible zero percent chance probability one means certain hundred percent chance of probability always happens and 0.5 means 50 50. so let's calculate probability of flipping a coin and getting heads so we have two possible outcomes heads and tails and but we are calculating probability for just one of those outcome heads so we say heads outcomes can be just one of those two divided by the total outcomes which is two which is gonna give us one half or 0.5 or 50 percent so for rolling a dice it's same we just have one over six now because one side out of six possible but in this case we are assuming that each of these six sides has equal probability to happen so then you can divide uh one over six to get your probability same goes for heads if each of the heads or tails has equal probability then you can divide here if we flip the coin ten times we get seven times head three times tails so theoretical probability is 0.5 but actual probability measured i should say is 0.7 or 70 percent but as we flip more coins the measured probability will get closer and closer to theoretical so for 100 coins we get 0.39 and for 10 000 coins we get 0.5024 so this one experimental is very close to the theoretical so the more flips or more samples the experiment probability measure probability will get closer to the theoretical and so if we try that with dices we will get over with thousand rolls we will get close to the theoretical of 0.1667 and here the orange one shows theoretical probability of each dice face and it's always the same but the measured one is going to be a bit different but if we do a lot more measurings then this measured one will get closer and closer to the theoretical now what is the chance that if we roll two dices one dice has gets give us one and the second dice gives us six so it doesn't matter which dice gives which number we just need to get one and six and the probability there would be by multiplying those two probabilities so it's going to be a lot smaller probability to get both of them now what is the chance that in those two dices we get one or six but not both of them just either one or six so it can be one on the first dice or one on the second dice or six on the first dice or six on the second dice so there are four possible ways to satisfy this but if we get one and six that's not what we want so so we will subtract the probability of getting one and six because that's not what we're looking for and so the probability of getting one is this and probability of getting six is the same and then subtractive we get both one and six and this is the probability that we get one or six if we roll two dices here we have probability distribution so this is the task if we roll two dices what is the probability that the sum of the numbers of each dices like we get one and six so the sum is seven so what is the probability of each of these sums for example if the sum is two it means we got one and one if the sum is seven it means we got three and four and four and three and two five five five two one six six one so you can see that we can get seven in a lot more different ways than for example two because we can get two as a sum if only if both of the both of the dices are one but seven we can get in a lot more different ways so probability of the sum being seven is a lot higher and that's the probability distribution overall of these sums so what is the expected value of a dice roll uh we know that dice can show us numbers one two three four five six and each of these numbers has one over six chance to happen so what is the average or expected number we will get by just rolling a bunch of times you can maybe intuitively understand that the average number will be 3.5 because this is exactly in the middle of one and six but we can calculate expected value by multiplying the value we will get so one two three four times its probability one over six is the probability for each of these values and then sum up so sum up and divide by the number to make to do the to find the average or mean but this is very important this summing up of all of the values and dividing by the number is just because every value has this equal probability to happen that's why we can pull them into a bracket and then divide by this common denominator is the same for every one but if these probabilities were different not all are like saying one over six then we would not be able to like sum up the values and divide by six so summing up and dividing only works if the probabilities are same so that's like a shortcut you can do but this is more general formula you just do each value times this probability and that's but this is more specific this works if probabilities are same then we have conditional probability so probability of a happening given that b happened you can calculate this by this formula so probability of a and b divided by probability of b it's easier to understand this of an in an example so given that we roll an even number what's the probability that that number is six so first of all probability to roll an even number is 0.5 because we have three odd three even numbers on a dice and probability of rolling six and even so you need to think a little bit so if you roll six it will also be even so here we're just gonna use the probability of rolling six because if we roll six it will also be even so that's the probability and so the formula uh we intuitively know that let's not let's first look at intuition so we have three of these so if we roll an even number then because each of these even numbers has same probability then it's one third one third and one third to be each of these numbers because they have same probability so that's what we will get in the result one third or 0.03 0.3333 so and if we check the formula it will actually come out as one third so that pro so the probability of getting six given that we got even number is one third and this is the law of large numbers so if you have probability theoretical probability 0.5 you're flipping the coin in the beginning where you just flip maybe 10 times maybe you will actually get seven heads and three tails but as you flip more more more more more more the measured probability will converge towards the theoretical probability so that's the law of large numbers if you have large numbers they will converge towards theoretical probability now this is also interesting let's say you are tossing a coin 10 times what's the probability that out of those 10 times you get five heads i actually recommend you go to my channel and here in playlists you check out this become ai researcher probability because this is a playlist of going into probability a lot more in a lot more details and you can check just first video or first or sec and second those contain most important things so random variable standard deviation variance and then continuous random variable probability density function so maybe these two first videos you can check that's gonna be it on this part on maths you can join my school to become ai researcher link below the video or you can find it on my channel here on my channel get a seven day free trial if you join right now in this lesson we will learn about pytorch necessary for ai research after this short course you will be able to understand and code any neural network any ai research in pytorch so i presume basic understanding of python you can learn that on youtube if you never use python so here i'm gonna go into extensions i am on vs code or cursor and i'm just gonna say collab and i'm gonna install this google collab if i want to run my code on google collab from this local environment you can also open these notebooks on google collab or you can just run them locally but here i'm gonna run on google collab from my local environment but you can have whatever setup you want so i will choose select another kernel here so i can click here collab or select another kernel and collab and then i can start a new collab server i'm gonna choose cpu and just press enter for so here i'm just gonna use google's cpu and then i'm gonna choose this python kernel and let's see it should work and it works i chose cpu because i can use it for longer than gpu and if you don't have pytorch installed like you run this that says no module named torch then you would just install so you would pip install torch in your console in your cmd right here pip install torch or in the google collab or wherever so let's learn about tensors so you can create a tensor tensor is like an array it can also be array of arrays or it can be array of arrays of arrays so any a number of nested arrays within arrays so here we have a 2d tensor i'm gonna zoom in so it's like a matrix so this is the first row this is the second row and the data type is float32 here so this is from python you can self-study what this is this is from python so this is kind of a prerequisite but if you don't know you can ask chatgpt all of these things that i skip so if i go ahead and print all of this this is the tensor that we have this is its shape so we have two rows and each row has uh three columns or so one two three one two three so there is three columns in total so two rows three columns we can flatten this tensor which means we will just convert it into a single array and so it's gonna it's gonna go like this one two three four five six so you will just put rows next to each other and it's gonna be a single array it will not be two arrays like here so the shape is now just six because there is six numbers in this one array and this flattening is done with tensor.flatten and t is our tensor remember so t is our tensor and then we can also specify this star dimension to be zero in this case this will look the same as without specifying this star dimension zero uh for me to explain what this star dimension does let's look at a different tensor it's a 3d tensor so there are it's two two two so inner arrays have two elements there is two of these and then there is two of these outside so it's two to two so if i say flatten but star dimension is equal to one so we will actually preserve the dimension zero we will preserve dimension zero and we will start flattening from dimension one so so what that means is uh we will not have one big single array instead we will preserve these outside these most like outer two elements and then we will flatten within them so we will flatten within these outer elements so it will be one two three four because we are preserving dimension zero and we are starting from dimension one so you can play around with this and uh try to change it and understand what's happening here try to make bigger tensors try to put some other dimensions now let's see reshaping so we have a tensor one two three four five six seven eight and the shape here is gonna be two rows and a four columns so two four and so we wanna reshape for it to be four two instead so let's see what it's gonna do it's gonna group like this one two three four five six so now we're gonna have four rows and two columns and it's gonna just start from the beginning so it's gonna group these two these two these two these two also if we have a tensor two by three by four now i want you to understand what here is two what here is three with what here is four so you need to like think about it so this is four and then there is three of them and then there is one two of these big ones so you need to understand this and we want to reshape this tensor to six four what is that gonna do it's gonna create six rows each row having four columns or four elements so four columns so in that case this four will be preserved and these two and three will be merged into six and you see that uh you have to have equal number of elements at the end so two times three times four must be equal to six times four so you can play with this and try to understand what's happening how it works if you change it a little bit now there is also view which is similar to reshape but i'll show the difference so if we have a tensor like this we say we want to reshape that to have four rows and two columns instead so this will glue group one two and then three four and then five six and seven eight like this so this is interesting so we created a new tensor that's gonna be t view so we have original tensor t and t view is the new tensor but if we change t view it will also change original tensor because t view is actually using the same elements the same memory slots in memory as this original tensor so when we change t view we will also change this original one so keep that in mind that can be useful for some situations so we see that the first element is one of both of them and then we're gonna just modify t view at 99 and then the original first element now also becomes 99 so uh view requires tensor to be contiguous in memory means all of the elements need to be next to each other in memory reshape will work even if the tensor is not contiguous whether or not it's contiguous so whether or not it's all of the elements are next to each other in memory or not and it will create a copy of the tensor if necessary if it's not next to each other join my school to become ai researcher we have all of these courses that will lead you step by step get seven day free trial if you join right now let's check squeezing so we have a tensor that has elements one two three but there is another dimension around it and another dimension around it so the shape of this tensor it's gonna be one one three because the inner array has three elements and then there is and then that array is an element of the outer array and then that array is an element of the outer outer array you see here and so we can use this command t dot squeeze so now we're gonna create a new tensor t squeeze is equal to t dot squeeze and this will remove all of these outer dimensions that have just dimension one so now we will just be left with this tensor or this array i should say or whatever it can be also tensor but the outer dimensions that have one element will be removed that's what squeezing does we can also specify at which dimension we want to squeeze so if we look at this tensor and can you tell me what the shape of this tensor is because there is three and then but three is an element within an array and then there is three of those elements or i should say three of those arrays and then there is this outside so the shape of this is one three one because there is three of these and each of them has one element inside and there is like one outside dimension so if i say i want to squeeze the dimension zero this will remove this outermost dimension so let's see what happens now this has removed this outermost dimension and so we just have one array of three of these arrays small arrays one element arrays so you can play with this and understand how this works i can also squeeze at dimension two so t2 dot squeeze and remember t2 is our original array so this is not our squeeze i'm not squeezing already squeezed array i'm squeezing now the original array again or i should say tensor so squeeze the dimension two here so this will remove you see here so every element was originally in its in its own array but i'm gonna remove that dimension two here so now these elements will no longer be in its own array and so we will just have these elements in this so now it's one three so because we removed this and now we just want three and unsqueeze does the opposite it will add the dimension of size one so if we have some tensor one two three four we say unsqueeze at dimension zero so the previous shape was four and then we unsqueeze at dimension zero so it will add one dimension at position zero and it will have size one so now that means we have another array outside of this original tensor and we can also unsqueeze at dimension one so if we take this unsqueeze at dimension one it will be four one now because it will add dimension uh after this and that means that each of these will be in its own brackets or i should say in its own array of size one so this is the difference between unsqueeze at zero unsqueeze at one you can play with this and understand like how it works and why it works and what it does you can also unsqueeze at minus one which will add uh dimension at the end so whatever your so if you don't know the actual shape of your tensor but you just wanna put every element in its own array then you can say unsqueeze at dimension minus one and this will put every element in its own array so it will add another so even if every element was already in its own array it will add one more array within that array so every element will just get brackets around it if you unsqueeze at dimension minus one this is good if you wanna separate if you have some array and you need to separate it for something do some operations on every single element so that's how it works so we are now at the first lesson here first notebook uh i'm gonna show you how to create different types of tensors so we can do torch.zeros five that's gonna just create a tensor of five zeros but you can but whatever shape you put here like three comma four it will create a tensor with this shape with all zeros so three rows each row has four elements or four columns so three rows each row has four elements uh we have ones and this full which we can specify the value so once five this will create an array of five ones or this matrix of ones let's see what it what it looks like so array of five ones or matrix with this shape or a matrix where each value is five you can also convert numpy array to a tensor so if you have numpy array you just say tors dot from numpy to convert it into a tensor and you can take the tensor and say dot numpy to convert it back to numpy array so if you check the classes it's either gonna be torch tensor if it's torch tensor or numpy dot nd array and when you convert them so when you use torch dot from numpy it shares the memory with the numpy array so when you change this new torch tensor it will also change the original numpy array variable you can also if you clone so that will create a separate uh copy so then it will be separated the numpy array and the tensor you can also create tensors with random values this is used in neural networks when initializing weights so you just say torch dot rent and then the shape and you can also specify you can also use rent int to make it integer between zero and ten but zero is included ten is not included so zero is inclusive ten is not inclusive or maybe exclusive uh this is our first so this is example of the first thing we did here is torch dot rent three four so that's gonna give us this these numbers between zero and one and so three arrays or three rows and each row has four columns and then the second run int will give us integers like this between zero and nine we can also create random values from normal distribution where mean is zero and standard deviation is one so if we use torch dot run n so this should be bunched up around zero these numbers should be very close to zero and then the further away you go from zero in positive or negative there is less probability although you can get any number here but the further away from zero you go the probability decays exponentially so you're usually not gonna get a number that are too far away but you see we get even numbers larger than one or so we also have um you can also use this trick because torch dot rent will give you values between zero and one but you can multiply all of those values with 10 so then they will be between zero and ten or yeah zero and ten so here we can see that values are between zero and ten you can also print data type of the tensor so it's gonna be float if you initialize it like this but you can also specify data type to be int64 or float32 because originally it's float32 you can also convert so if you have int64 you say dot so that tensor dot float it will convert to float32 for the next one you would need to know how matrix multiplication works in mathematics so you can just watch my video here math for ai research or you can just type here like how to multiply matrices matrix multiplication on youtube so if you know how matrices are multiplied dot product between this row and this column so you need to understand that then we can go further go on and this is the operator to multiply matrices so you can also use tor.matmul by the way so if you have this matrix and this matrix the result will be this matrix of course the same result depending on which you use at sign or tor.matmul the result will be the same one thing you need to know so when you have matrix and vector you multiply them the same way you're gonna get some other vector now you can imagine this vector as being some vector and this resulting vector as being some other vector so the rule in maths is when you multiply vector with a matrix that matrix will rotate and squish or stretch the vector so that's what matrices does and this is important in neural networks if you want to do this theoretical or research and stuff so just know that when you have a vector you multiply it with a matrix it's gonna stretch or squish or and rotate the vector possibly rotate it may not rotate by the way i will go to third lesson third notebook here in pytorch so i'm gonna skip some of the matrix stuff that's not so important if we have a tensor in this case it's a matrix we can transpose it by using a dot t and i'll show you what transposing means and we can use a dot t with these brackets but only for 2d tensors so this is what it's gonna do it's gonna swap rows and columns so this was a column now it's gonna become row one four was a column it's gonna become a row now two five was a column it's gonna become row and you see that one two three was a row now one two three is a column so it's swapping rows and columns transposing let's see if we have a 3d matrix one two three four and then that's one matrix and then there are two matrices and we want to transpose uh the zeroth so so that's gonna be this outer dimension and the last dimension which has index two so that's gonna be a little bit weird so now we did this we say one and five now becomes uh this row so now the first element of the the first elements of the outer most dimensions which are these two now become elements of the innermost dimension because we transpose them we swap them and then the innermost dimension elements in the previous become outermost dimension first elements so one and two so this one this is going to require some intuition some playing around some understanding like how this transposing works we can also rearrange dimensions so now we are in this case we are reversing the order but you can put any order you want so let's see what happens when we reverse the order of dimensions so the original one is one two three four five six seven eight but then one five three seven let's see one five three seven it's a little weird two six four eight so this is something that you just need to kind of look at this and get the intuition why these numbers how what swapped etc so indexing and slicing if we have a tensor one two three four three four five six seven eight nine then we can index so this is the this element has indices zero zero because it's zero row and zero column it starts from zero not from one so then this element would have one one because it's so this is zero throw first row second row zero column one column two so this has one one this has uh two so this has no this is one two because row one and column two so we see that t of zero one is two so zeroth column zeroth row first column first column that's gonna be zero throw and then column one that's gonna be two you can also access entire row so this will give you entire row and let's see one two three entire row and you can also get the first column or entire column not the entire column but you need to put uh this sign here and this means get all of the rows in the zeroth column all of the rows in the zeroth column so all of the rows but only in the zeroth column so one four seven so one four seven is the it's gonna be our new tensor and you see all of the rows in the column with index one so that's gonna be all of the rows but in this column with index one two five eight two five eight now let's see slicing so if you say t and then you say zero to two this will give you all of the rows uh starting from the zero including row zero until two but excluding two so this is excluded this is inclusive this is exclusive so starting from here until here but excluding this one so it will just give you these two one two three four five six seven eight let's see one two three four five six seven eight and then i'm gonna now go a bit quicker so this will go from index from row one to the end so from row one including until the end so it will give you these two you can check below so this will give you from the beginning until two but excluding two so from the beginning so it will give this one and this one and it will not give you two this is index two so this is index zero one two it will not give you two from the beginning until two but not two okay and then you can also say columns so because you have comma here this is where you select rows and here this is all rows this means all rows so from the beginning to the end and in this case and is also included so when you have this you it's from the beginning to the end in rows and everything is included so from zero to two columns but two is not included okay so we got all of the rows okay so we got all of the rows and then zero one but two is not included so it should just give you these two one two five six nine ten let's see one two five six nine ten okay we can also get all of the rows and from two until the end so that's gonna say let me see here three four seven yeah that's it so from two including two to end columns and you can also pick like this so from zero to two excluding two rows and from one to three columns so from zero to two and from one it's gonna be this one two three but excluding three so this will just give you this middle what was the columns so it will give you this middle matrix you can also create a boolean mask so let me show you what that is so if you have a tensor one two three four etc you can say mask is equal to tensor t greater than five so what that's gonna do is replace everything that's greater than five with true all of the l where the value is greater than five so this is one two three four five okay and then this is greater than five six seven eight these elements will be replaced with true because these are the values so if i was to put number seven here instead of the number one that was originally then this would also be true because this element would also be greater than five so all of the elements values that are greater than five they get replaced with true and then you can use that mask that you just created so you can say t or original tensor t of mask it will give you all of the elements that have the true assigned to them like so six seven eight nine you also can do this conditional indexing so t where the value is greater than five or t where the number is even so value divided by two gives remainder zero so these will give us greater than five and even numbers you can also modify values like this so t where the value is greater than 50 you want to replace all with zero so everywhere in this tensor t where the value is greater than 50 it will all be replaced with zero so let's say we have these values here 100 200 300 400 and then it's gonna get replaced with zero because if we put put it this way if we now let's learn about concatenating tensors so if you have two tensors one two three four and five six seven eight you can just concatenate them in dimension zero or concatenate them or join them in dimension one let's see what both of those do so concatenating a dimension zero will just append all of these rows now it's all gonna become like a bunch of these rows you see one two so this guy this guy then just continue with this row and this row so now they're concatenated like that so it's four two the new shape is four two this dimension got concatenated the first dimension dimension zero and if we concatenate on dimension one or second dimension then it will go like this one two five six so the first element here and the first element here will get here and then three four seven eight and now the shape is two four so this also requires some playing around testing you need to see like what happens try to understand it yourself this is something that's um that you just need to play around with and understand you can also concatenate three tensors and it's done with torch.cat i forgot to say so three tensors it's gonna do the same thing so you're just gonna say one two three four five six etc so the same thing now it's six two you can also torch.stack if you have two matrices two by two and two by two you can kind of just put them into another array another tensor so this is this first matrix will be the first element and the second matrix will be the second element of this so now the shape is two two two because you join these two mate these two matrices into like another array now so that's what you can do with this torch.stack tensor one tensor two along dimension zero now i'm in the last seventh lesson let's see some more tricks so we have identity matrix if you say torch.i three let's see this will give you this identity matrix where it's once on the diagonal and zeros outside if you check my math course you will see that multiplying any matrix with this identity matrix will give you that same matrix this is like multiplying with one this identity matrix so if you have any matrix you multiply that with identity matrix so matrix multiply will give you the same matrix torch dot arrange 10 will give you an array from 0 to 9 because this is excluded so integers from 0 to 9 let's check here 0 to 9 arrange a torch.arrange 2 to 8 will give you integers 2 to 7 because this last is excluded 2 to 7. you can also say arrange 0 to 10 with step 2 this means it will go 0 2 4 6 8 and 10 the last one is excluded you can say step 0.5 so it will go 0 0.5 1 1.5 until 4.5 you can also set data type to be float 32 then it will give you this like 0 dot 1 dot so these are the floats and this is what you will often do is you will arrange and then reshape so this will give you numbers 0 to 11 so there will be 12 numbers and then reshape to 3 4 3 rows 4 columns and this is what it's gonna look like you're gonna do this often lean space will give you the following so if you say 0 to 1 and 5 values so it will give you five values in between these numbers equally spaced including these two numbers so starting from 0 to 1 and in between and it's evenly spaced 0.25 0.25 so then you can also say from 0 to 10 and then 10 evenly spaced numbers and you see here in link space this 10 is also included so previously it was excluded but now it's included you need to just memorize that that those are that's how it works from 5 to minus from minus 5 to 5 11 numbers so in this case it will just be the integers and this is interesting if you say torch.empty and then the shape so this will create a tensor or in memory but it will not change the values it will not set any values so it will just allocate space in memory so whatever numbers or values were in the memory it will have these values you see we have some garbage values here because we did not change them we just took the memory but we didn't change anything so that's gonna be it for this video check out the github checkout the previous course on math join my school and see you in the next video let's learn about neural networks after this lesson you will be able to code your own neural network from scratch you will understand how this works and you will be able to then further build it into projects and ai research etc make sure to have pytorch installed numpy matplotlib as well you can find this repository at the github below this video so this is a neural network but let's just first focus on a single neuron so let's just look at this first blue circle and the first white circle and this output and let's imagine none of these other exist so input would be some number and then it's multiplied with some weight this weight is learned by the neural network so this weight will also be some number and when input is multiplied with this weight it will give some output so the output is gonna trans so the neural network will transform this input into output will predict something and the weight that's multiplying the input and transforming it into output is gonna be learned during training so what is the best weight that's gonna give us correct output based on any input that's how neural network learns and then generates or predicts so let's check code example let's say we have inputs 2 and 3 and then these would be weights uh these would be neural network weights like 0.5 and 0.3 that are learned that are already learned so we'll see later how they are learned now we could also have a bias it's just a single number that's added for the entire layer entire set of weights entire set of weights has a single bias number and you see here the output the weighted sum it's gonna be this weight times this input plus this weight times this input and then at the end we have some bias so bias is also learned it also helps if you know functions bias will help this function whatever it is to move left and right and these weights are gonna determine the slope so this is in my math course for machine learning that i published earlier in this case this is just a dot product right here is just a dot product between input and weights and so weighted sum is two in this case and then because this is going to be some linear function if we want to make it maybe curved we need to actually pass this through some activation function for example sigmoid now what when we pass this output weighted sum some through some function like sigmoid it looks kind of weird but let's just ignore like whites why it's like this let's just imagine that sigmoid is gonna make this function like curved so if you have some data points here data points here you want to actually separate them with a curved function if you have a straight line then it will not separate these like messy data points so you need to be able to curve the function to separate the data points that's what activation function will do it will take this weighted sum which is a line and make it curved and so at the end for these specific numbers we get this activation or output so this here is sigmoid function we also have other activation functions like like relu and their purpose in neural networks is to curve the lines that are because lines we get with just multiplying weights and that bias are straight lines because here let's say we want to separate maybe this red this red data set we cannot do it with a straight line because it will it can just like cross over it we need a curved line that goes around this like this so that's why we have activation functions to curve the lines so that's how neural network will be able to learn how to separate these data sets but let's see on some examples more join my school community to become ai researcher link below or you can find link on my channel here in the about section so usually you would first initialize a neuron so weights would be initialized as random numbers from a normal distribution i explained that in previous pytorch lesson on my channel and then make it very small so maybe these will be around like zero to one a bit more maybe but we won't make them all very small so they are around 0.1 bias goes at zero because bias is being added so in the beginning we add zero but you don't want to initialize weights at zero because weights are multiplying so whatever then they will just turn everything to zero so that's not good you want to initialize weight with some number and here we would instantiate this class neuron and let's say number of inputs is equal to two so then this will create two weights and one bias so this number of inputs just influence the the amount of weights so this is example number of inputs two and these are the random weights that we generated they're really very small because we multiply with 0.1 so then to our neuron class we want to add this weighted sum function so it will take inputs the numbers so this is different from number of input this is the the input the actual inputs numbers and this will do dot product between weights and inputs now this is also matrix multiplication it it says here dot it's a bit confusingly named but this is matrix multiplication between weights and inputs and then add bias at the end and then we're also going to add this part to our code so after we initialize you already we already we already seen this part of initialization we want to add this part so i set some inputs here and then i call neuron dot weighted sum i call our new function to do matrix multiplication so the weighted sum here is two which also includes this bias by the way now we will add third function to our neuron class which is going to be this sigmoid function and so it makes it non-linear so it curves our line because when we just multiply with inputs with weights and that bias is going to be a straight line the function is going to be a straight line the function of weights times inputs plus bias but sigmoid makes it curve or activation function makes it curve and we will use sigmoid because it has like a nice curvy shape that neural network can use well so this is the formula for the sigmoid function we need to pass in this number and then the formula is one over one plus e2 minus x and then we will just pass our activation or i should say the output of the weighted sum through sigmoid and we will get some activation value sigmoid will be between zero and one like this it's going to be curved so if neural network needs to make this maybe left curve it will make weights and biases such that we get some number around minus two and minus one or something like that and then after activation it will get this left curve to maybe encircle some data if it needs this other right curve then it will make its learn its weights and biases to give a number around one two so it will get this curve here to encircle some other data in this way so this activation function will never change it's just the weights and biases but then neural network will learn to adjust weights and biases so when the output is passed through activation function they will properly encircle data or be able to predict or generate data that is it will have proper understanding of the data so now we can test uh this with different out different inputs sorry so zero zero one zero two three and you will see that the output is always some number between zero and one we can also try different activation functions so sigmoid will give 0.8 uh relu will give two this tonnage will give 0.96 so depending on the activation function you can you need to find one that works well for your data set for example relu is very easy and fast to compute but it's not as powerful so maybe if you need some quick and easy task now let's see how to combine multiple of these neurons not just one into a neural network layer so again we will have input two three let's say but then we will have two neurons the first neuron is this one 0.5 0.3 and the second neuron is 0.2 0.4 so you see that it has weights for each of the inputs and we will have two biases so one for each of the neurons and the forward pass is the matrix multiplication between inputs and weights transposed so i explain matrix multiplication in math and pytorch fundamentals but you can also check how it works but you need to transpose weights here now uh some people can also write like weights uh mat matrix multiply inputs or inputs transpose now this just depends on how you save your weights in memory you know that matrix multiplication there are some rules but for now let's just take it as is so inputs at weights transpose plus biases so we have input tensor at weights transposed at the end we get this output to 1.8 and then if we pass it through sigmoid activation each of them will be some number between zero and one depending on the input here so here we would define a class layer and then it would have number of inputs and number of neurons so these are like hidden neurons it can be different number you see here we have three inputs but five neurons five hidden neurons i should say and then one output at the end so in our linear layer that we initialize we have number of inputs and number of neurons and weights are gonna be uh number of neurons comma number of inputs this is one of those weird things that you just need to kind of get the intuition by trying and understanding so number of neurons is three means we have we have three rows one two three rows and number of inputs is two and there's the second dimension here so we have two columns so in each row there is two so two two two so three rows two columns and that's why we also need to transpose because we have two inputs we need to transpose this matrix so we because of matrix multiplication we will multiply these two and these two and then these two these two these two these two so we need to transpose flip the matrix of weights but i recommend you just try to understand this yourself or you try like playing around see what happens try to draw understand how matrix multiplication works first you need to practice this these are the example weights that are initialized and the input is two so when you multiply input matrix and weights matrix you need to transpose this because you are in multiplying row of the input with the column of the weights so you want to multiply want to multiply it with this row but you cannot multiply row and row so you need to transpose this row to become a column so now it's just two two two like this so then after initialization we will add this weighted sum and here we just do inputs matrix multiply weights transposed we need to transpose them as i explained plus biases and at the end you will get something like this we will get three numbers because the hidden number of neurons is three you know we are multiplying one times two matrix of inputs times uh two times three matrix of weights so the result will be one times three so this is matrix multiplication so you have these three neurons that you get as a result so again i recommend you try this out yourself change some numbers try to understand how this matrix multiplication works and then it's very easy to add this activation function so torch dot sigmoid and pass in the neurons so before the activation the numbers that we got of the neurons are these and then after activation it's going to be this tensor or vector so after we have this sigmoid we would add that to this forward function so in the forward function we first do inputs times weights plus biases and then pass it pass the result through sigmoid activation and then that's the output those are the neurons that we get and so we can say output is equal to layer dot forward and pass in impulse this is how you would usually do it in neural networks large language models reinforcement learning everywhere here we can take a look at some activation functions so i already explained sigmoid it puts any value on x between zero and one output is always between zero and one we also have relu this is very easy and quick to compute so if the input is lower than zero or zero it will just become zero the output will just become zero and if the input is larger than zero it will just remain that same number whatever it is so r of z z is our number it's going to be max maximum between zero and z so that's the relu function you just find set zero or z whichever is bigger this is very fast to compute you would use this if you need some quick computation with that's not so maybe powerful or precise but you just need some uh easy task for the neural network and maybe it can do it with this activation function and you see here it would just if you have a line it will it will maybe make that line like this it's not going to be curved it's going to be like this just like cut or broken like this but it will be a continuous line it will not be like separated but for some tasks it's not good if you make outputs that are lower than zero exactly zero always so you kind of erase all of that information so there is a leaky relu so outputs that are lower than zero it will just make them a lot closer to zero but not zero so it will just squish them towards zero but they will still not be zero so they will be 0.1 times x if they are lower than zero or they will just remain x if they are higher than zero so this doesn't erase completely information like the relu function while it still provides non-linearity this bending so the neural network can use this bending to encircle some data here we have some examples you see this is the zero so relu is just zero and then gelu is actually uh more curved around zero here so it preserves information if the number is around minus two so and then we have elu that's kind of going down even more so we have these different activations you would test which ones work best for your data here is another example sigmoid relu silu swish elu they're all kind of similar but they do a bit different things but they will all curve your line of weights and outputs and neurons you see with just multiplying weights and inputs and biases you will get a straight line but then passing that through activation function you would get a curvy line this is a good example as well you want to fit this data with a straight line it's going to be tough you're going to have to leave out a bunch of this red data on the other side which is wrong but with a curvy line maybe this is a sigmoid function here you can fit data properly now to understand how neural networks learn i have three good videos so you can watch this one back propagation from scratch torch.backward and this one where i train a neural network i'm gonna leave all these links in the description or you can just search for them i also put these three videos here in the third lesson after these neural network and layer and stuff so you can just copy these urls that's gonna be it for this video and see you in the next one welcome to attention mechanism and self-attention in transformers in large language models so let's say we have a sentence the cat sat on the mat let's say we are now we have just these words in front of the mat and we're trying to predict the next word that is mat so how do we know what's the next word well we're gonna take some information from the previous words so we have two challenges here how do we know from which words to take information for example here cat set cat said that would give us most clue that this could be like a mat so how to select previous words and how to actually take information how do we actually like take information what does that mean how is that done so in this example we see that cat is selected to have 60 percent uh relevance set 20 percent the 10 on these are have smaller okay and so you can ignore this last imagine this doesn't exist so i'm explaining like how to generate this in this case attention weights will sum up to 100 so do they sum up i think they do okay but if we didn't have this three percent this one then this three percent would actually be like within these as well so these would sum up to 100 if we didn't have this last word that we are trying to predict so model learns to which word to focus on the way llm works is if you are predicting this word met it's gonna look take this word previous word the and combine add all of the relevant information into this word so now this word will have not only this word but all of the context before it as well and then after processing processing processing processing uh move this this is done like many times over and over again it will just convert the final vector that's juiced up with all of the previous information into this next word so how do we juice up this vector with all of the previous words all of the previous information so every of the tokens including this one will have will create three things query key and value this is confusing but bear with me so each of these words or tokens will be like a vector embedding it's a vector you can check my previous lessons math pytorch and neural networks to understand this so it's a vector and so that vector will transform into three vectors for each of the token vectors so each of the tokens here each of that will get three query key value and usually this query key and value are a lot smaller five ten times smaller than this big token embedding so this token vector token embedding represents the meaning of these words so just these numbers represent meaning so for example large language model learns what each of the numbers in the vector each of the dimensions represents for example maybe the first number in the vector represents if something is green so the higher the number the more this token has this green characteristics so for example if you have like some green apple the first dimension will have a lot higher value than a word heart that's associated with red for example so the first number first dimension in the vector of word heart will have a lot lower value so neural network will learn what these features are and in this vector this vector is same size for every token it represents the token same size and the position the dimension for a second it's always the same so zero dimension always represents the same thing for every token like how this how green is this so each token so each token will have query key and value a query says which information am i looking for and then key says which information i have and the value is the information that it gives now keep in mind there is a difference between explaining which information which information you have and that information so that's important difference so if we are on this word the its query will dot product with all of the keys individually okay so if these words these tokens query dot product this tokens key as i said the key is explaining what this token holds what's information if the dot product is a high number we call that high affinity then you will just multiply the value of this token the value which is the actual information it gives with that number with that dot product of the query and key and add to this to value of this token okay in the end we are just adding values but but but the query and key have two functions how much to weight the value how much so do you want to reduce so if you multiply the value of this with a small number then when you add that small number to this value it will not influence a lot because this value will be multiplied with like 0.01 so when you add that it's a very small number in each dimension so it will not change too much this and if that 0.01 is like the dot product between this query and this key it means like we don't care about i don't care about this information too much there is no like affinity there but if that product is high um it can be larger close to one for example let's say 0.9 then when you multiply this value with 0.9 and add this value will have a lot stronger influence because now it's a lot larger number than when you multiply it with like 0.01 so basically to repeat if you are predicting this token then you look at this token and you add all of the values to the value of this token now remember i said we're not adding the vector embedding of this token of these tokens we are adding their val there we are transforming that vector embedding using a small neural network one layer neural network into a value vector and value vector is a lot smaller than the big vector embedding so we don't actually need to have so much processing we don't need to work with the huge vectors we can just transform it to small vectors for faster processing and it's sufficient and enough to do this so we take we just add values of each of them to this last known token add those vectors add the vectors but how much of the vectors we add will be decided by the query times key so query of what i'm looking for what i give if that number is low that it means like okay i don't really want what you give or i just want a little bit of what you give so then the value will be multiplied the value vector will multiply with that small number so you will just get like a little bit of influence here when you add that weighted value to this value in the end all of the values will be added to this value and maybe this cat value will have a high multiplication number high weight and so it will influence like give a lot of context influence this value and then set can also give a lot of context so be a large number as well so this is the self-attention formula in this case so this is where we start so query this is a query vector of this token current token actually in this formula this is meant like for matrix of all of the queries of all of tokens but not to complicate let's just focus on a single token so query of the current token times all of the keys of every previous token so instead of calculating keys one by one so query times key query times key query times key we put all of this into matrix and we calculate this vector with this key matrix and transpose is just because that's how you do matrix multiplication so that's why it's like here it's like to indicate it's a matrix multiplication need to take care like which dimensions what's transpose what's not which dimensions but basically you multiply so this is going to be your query this is going to be your keys for example or you can have a query like this and keys like this it's just a matrix multiplication and then this is dividing because here as i said you just have one query so the result will just be also one vector and you want to kind of normalize that vector if it has some large numbers neural networks don't like large numbers it's a bit tough to explain but mere multiplication of large numbers it behaves differently and it can influence so not to over complicate it we'll explain that later uh neural network learns better if the numbers are small between zero and one or a bit maybe a bit bigger than one but not too much bigger so that's why we just multiply by the square root of the number of dimensions of these key or query so for example if queries has 500 dimensions it's a vector 500 numbers long you you divide by the square root of 500 and key and query will have same dimension then comes this softmax so remember this is result of this is just an array and it has some numbers so maybe biggest number would be here it just dot products so they don't yet add up to one or to 100 percent they are just like big number here a bit smaller number here smaller so you need to make them add up to one or hundred percent that's why you do softmax you pass them through softmax so they all add up to one or two hundred percent while retaining their relative position so this guy that had the biggest dot product will still be the highest you'll still get the highest percentage let's see this on an example so these could be our values so this is a value vector for the cat set and these are our weights so it means like to generate the next word this is 10 important 30 important and 60 important and so we multiply 0.1 times this first and second 30 percent times this and this times this and then we add up all of the vectors we add them up so we so we add first dimension here after the multiplication so first dimension second dimension so we add element wise vectors we get just one vector so four values it's called like a weighted value vector this is the example so multiply each vector so you get these numbers and then you add this guy goes here this guy goes here this guy so you add all of the third elements into this all the fourth elements into this so this is the weighted value vector that you get in the end and now this output will be kind of dominated by this uh third vector because you see this these values are very small because our weight is 0.1 and this these values are gonna be biggest because our weight is 0.6 so the output will don't be dominated by this because it has large weight here i explain a query key value again so um example so query is what i'm looking for if processing the word set the query might encode i need to find the subject of this action so then it will fire when it sees a cat it will fire so when you have a query of the word set and it sees word cat it will have high dot product between the query of the word set and key of the word cat because this query learned to adjust its weights to look for nouns or subjects the key for example of the word cat might say i am a noun an animal a potential subject so uh as i said it's gonna have this key vector and key vector will just have a high dot product with this query vector that's how neural network just learns those weights to be to take to get a high product between set and cat and value is the actual vector that gets added up the actual information it gives so it's different from describing what it has the key the key describes what value has or what value will give so if you go down in my school uh here we have learned to code and build llms from scratch you can check a few of these i'm actually explaining all of this over and over again in all of these so it will just require like a lot of practice explanations i'm gonna add more exercises videos so that's gonna be it for this video and see you in the next lesson welcome to self-attention lesson here we will focus more on code so you can check the first lesson attention mechanism explained and here it's like code implementation so uh key query and value which i explained are all linear layers so i mean they are created with linear layers you pass in embedding dimension which is the vector dimension the model lm dimension and in this case we will have the same dimension for the query and key and value okay sometimes you can have smaller dimensions here but so this will just go from embedding dimension to embedding but it will transform so it will have some weights and transform this vector embedding into its query so this query will say what information is this token is this token looking for i explained this in the last video key is describing the information and value is the actual information so key is describing what value is and all of them are linear layers that generate this key query and value and same input if we scroll down a bit we can see a classic example in a large language model where you have sequence length which are just like your conversation your words so your context window when you talk to chat gpt okay and batch means there is like a multiple context windows multiple independent conversations okay and every token in that sequence is represented with a vector so that's why we have vector or this is token sequence length and batch which can be this 32 conversations i changed it to conversations each conversation has 100 tokens and 512 dimensions for example for each token so we pass x all of this this entire thing we generate the queries for every token okay so there is 32 times 100 tokens generate query for each key for each value for each that's how we code it and so we have query matrix all queries key value so the scores are calculated with queries at queries matrix multiply key transpose where we transpose the last and second to last because that's just how matrix multiplication works this is this is for the reasons of matrix multiplication you need to have this transposed and that's it so the shape will be batch size sequence length sequence length so so here every token will have score with every other token so there will be score between every single pair of tokens so it's like a square that's why they say attention grows quadratically for every token you need to add all of the previous tokens to the score list so the scores again you can imagine a similarity between these two tokens or i should say affinity or how much this token cares for this other token how much of its information it wants to take its context and then we will just scale them back for stability reasons so this this this means like square root so this is the square root of the embedding dimension of the of the dimension of the queries or keys and then we apply softmax to scores to put them between like add up to 100 percent so we have like and then those are weights so now for every token i explained this in the previous lesson but here we wanna for every token we take its value and add all of the values of other tokens but weighted so if for this token like this other token is very important then this attention weights will actually multiply its value with a high number and then we add just like add element wise this vector but if some value is not so important it will get weighted by a smaller multiplied by a smaller number smaller attention weight and now that this condensed or squished numbers that are around maybe zero will all get added to this value so it will not influence too much this value that it's added to so it didn't take too much context from this token i explained this in previous lesson you can check also so then you would be able to just define self-attention with the dimension you want for keys queries values and then you would just pass in some x through the attention and you should get the output immediately which is a list of values for each token that's juiced up with the context from other tokens so here we don't have causal self-attention so here every token is taking attention from all of the previous and forward tokens but when we're training neural network we shouldn't be able to see forward tokens become that because then that would be easy to predict forward tokens if we see them so but it's not implemented here but i'm telling you we will learn this so we will actually just for each value we will uh take context or do multiplications with just previous tokens so that's going to be a short lesson on implementing attention in code let's learn to build and code gpt style llm so this is like chat gpt i recommend you watch these two previous lessons on attention mechanism explained and self-attention from scratch so if you've seen transformer architecture there are two parts and this is good for sequence to sequence when you have encoder and decoder so classic transformer architecture is good when you have like some sequence of words in one language you want to translate that into different language but gpt has only this part only decoder and that's good when you just want to keep generating text and as it turns out maybe this decoder only transformer is also better for translation and sequence to sequence so that's probably also it works better in practice even though the idea here is cool to have like two parts for two languages but anyways in our gpt architecture we just have decoder and we'll explain more what that is we know that large language model will predict the next word so if we have once upon a time very far away there is going to predict lived as the next likely word so the four parts of transformer we have causal masked multi-head self-attention which is actually explained here in self-attention i will explain what multiple heads mean and then add the normalization so when you have input you pass input token or tokens through self-attention and after that's processed you actually want to add same input and add this processed tokens with self-attention and add to the input that's not processed so add them together and when adding vectors uh then neural network will actually you give both information both processed information and raw information for the neural network so it has both both of the information and then it can it will use both if you remove this residual connection that just adds unprocessed information then neural network or language model will learn worse and slower because this initial tokens also contain important information so we we don't just want to process and just have that but also the initial So you want to add those vectors and this normalization this can happen usually before self-attention and it just to keep the vectors the numbers in a nice range between around zero and one or minus one because big numbers or very small numbers can cause problems like gradient vanishing gradient exploding because when you multiply numbers near zero near sorry near one they're gonna stay result is also gonna stay near one but if you multiply numbers like thousands and millions the other numbers next numbers results will explode and maybe such big numbers cannot be represented in computers so we want to keep our numbers around one zero feedforward network it's just a single multi-layer perceptron so it takes as input this processed and normalized so usually you would also put normalization before feedforward and this mask before feedforward in the original transformer it was after but now we would put like before so uh at the output of self-attention and the residual network when we add those two so this add and norm so i guess norm would go before but add would go after so uh with that added vector result we can normalize that again with this norm that would that we would be put uh before feedforward and then put that into feedforward so that's input now we expand into the middle layer hidden layer it's going to be like four times bigger usually than this input and so in this middle layer of our feedforward network there there are a lot of facts like about basketball players like all the knowledge that neural network learned it's like a data storage and also it also transforms uh this these tokens in different ways so it has multiple purposes and then in the hidden layer that's four times bigger than the token size in the input it shrinks back to the token size at the output and that's our one pass through the transformer now that can go back into attention into the next pass next layer back again and then we can repeat this cycle 10 20 30 maybe 60 times in big neural big large language models so we'll explain all of this again don't worry so these would be example hyper parameters here so a number of heads in self-attention so we will just divide each token vector in self-attention into different parts independent parts and so we will then do attention on those independent parts which are like independent key query value independent key query value all coming from the same token but we are dividing them into multiple because now each head will learn different things maybe some head will learn to like do mathematics some head will learn about medicine etc this is the dimension of the feed forward hidden layer that i was explaining that was four times bigger and here it is four times bigger because the model is the token embedding size and feed forward hidden dimension is four times bigger dropout just to prevent overfitting vocab size usually this is like 50 000 to 200 000 on big models and this is the maximum context window sequence like maximum of 512 tokens that you can have in your language model and number of blocks can also be called number of layers so how many times do we repeat this process so self-attention norms feed forward norm so that's one okay then we go output here we go back in and then back in again so three times in this case three blocks or three layers and these earlier layers will learn something maybe to find out some basic features in text and then later layers will learn maybe more complex things to generate next tokens or to give information information necessary to generate next tokens then we have rotary positional embeddings this is how we encode which token comes after which token this is the order of the order of tokens so is cat chasing the dog or dog chasing the cat we use rope to encode the position of the words of the tokens and let's go all the way down this is where it happens so we have our token our vector embedding for the token and we split it into two and the first so we have x1 is the first half of the token x2 is the second half of the token and then we apply this rotation matrix so rotation matrix is like a term in mathematics when you have a vector you can rotate the vector by multiplying it with this rotation matrix so the idea here is that um it's a bit more complex but i'll explain you want to rotate closer tokens or early tokens in the sentence maybe less and later tokens the sentence more so the later the token the more the more you want to rotate each vector that's how by this rotation neural network will learn like the order of the tokens how much you to just see how much every token is rotated to understand their orders and it's also good because it can also compare relatively tokens so third token and fifth token it can see that there is like two tokens in between uh relative to each other so we don't need absolute positions like uh this is absolutely one first second third there is also a relative besides the absolute so this rotation matrix the matrix that rotates the tokens can be expressed with uh sine and cosine and this is how we can this is one way to do it so we multiply first half with cosines minus second half with sine that's going to be our rotated x1 this is our first half now in the second half to get the new second half we're going to rotate the previous the original first half with sine plus second half with cosine so that's going to be our new second half and then we concatenate the first and second half and that's our new vector that now token embedding that now encodes positional information and this code above is just how we generate this sine and cosine and how we decide how much for each position we should rotate what the sine and cosine should be for each position but this is a bit complex so we'll leave this for another time this is not so important now this is a very good example here of the code so this is in andrei carpathis nano chat and apply rotary embeddings we just get the x our token embedding cosine and sine and sine and we will get we will split into x1 and x2 all of the x's so x is like not just in this case it's not just one vector it's all of the vectors so we will split every vector in halves and then we will apply the same rotation that i just explained and then concatenate this is the dimension because now we don't just have one vector we have like a whole batch of sequences of vectors but the logic is same as i explained then let's look at some code in implementing causal self-attention with rope we have a class causal self-attention and that inherits from nm module and we need to make sure that the token embedding is divided into number of heads without like it should be divisible by number of heads because every head should have equal number of numbers dimensions and then head dimension will be just dividing the token dimension with number of heads that's how many numbers or elements each head has and then we have a query key value so in this case we do all the query key value in one go so we go from d model to d model times three because d model is the token dimension and our query key value will also be the token dimension same size and you can learn more about query key value in the previous videos previous lessons about self-attention so we want to generate three of them we want to generate three uh query key value and each is each has size of the token so we need three times size the token so we go from we go we have a linear layer it's like some learned network weights that's going to generate our key query and value and we input the token and we output three times the token size which is going to be now different numbers and so later we will just split this into three like take the first third as a query second third as key last third is value we also have output head so after we combine all of the tokens so after we do attention with tokens this will just convert because at the end of attention we just added values from each attention each token to the last one we just added values now we need to convert from from this mixed up value vector to the actual token embedding to exit attention mechanism and keep in mind that in this case value key query value are same dimension as token but it doesn't have to be actually in big elements key query value are smaller dimension so at the end of attention mechanism where we added all of the weighted values we need to convert that mixed up value vector into the token embedding vector and do that for every token if we are training or just for one token if we are inferencing but we'll talk about that more later and we initialize rope here as well passing self head dimension in this case this is how we defined it so we need to pass in the dimension of the head and then in this case we want to get the batch size how many independent conversations sequence length length what is the maximum number of tokens in the sequence and i guess we don't need the token dimension here and so we generate key query values so by just passing x through our linear layer that generates key query values and then split so this will get us like one big vector three times uh model dimension and then we split it into three parts for query key value and then we need to split query key and value into heads as i said we will split query key value into heads so each head will learn different things like maybe one will be about math or calculation or nouns or verbs the other will do something else so we will use query dot view batch size sequence length that already exists but that where previously was one dimension here this was model dimension now we split into two dimensions so number of heads and head dimension number of heads and head dimension for each head so now we do that for all of these and we can apply rope then to query and key we don't need to apply row for value just query and key so next in attention calculation we want to do heads independently so we will transpose this to put heads first this is the batch size and heads are independent dimensions and then in every sequence length we will do heads of the key query value so sequence length is for the sequence length we will do same heads so this is separating here we will separate them he will we will also separate them here but now here and here they're together they're grouped up by by head and so we do this whole sequence length of the query key value together for this particular head and for the next head etc and then this is the formula for attention and we talked about this formula in the previous lessons this is how you do it and then we want to apply mask so that every token in a sequence can only look at the previous tokens and not the future tokens so it can only pay attention so that this is during the training by the way so in inference we don't need masks because we just have a sequence we just do the next token once we don't like need to calculate every every token with every other previous token in inference when we are generating okay but when we are training we do need to calculate every token with every token to train better to train on the whole thing so this mask again will just make sure that tokens cannot look at so query of the token cannot look at the keys of the following tokens because this learning to predict those so query can only look at the keys of the previous tokens for every token and then we calculate softmax which previously these scores are just numbers logits they don't add up to one they don't add up to 100 so they can be like 2 1 0.5 whatever and we need to make them add add up to 100 because let's take example this this token it needs to know how much to take of the information of context of every other previous token so it's a lot easier to do that if you just have them add up to 100 so it knows i should take 20 from this token in from 20 of its value uh 10 of this value 5 of this value so that what softmax will do it will just make the scores for every so for every token all of the scores with every token add up to one 200 so it knows how much and then we can drop out for like ignore some of the scores to stabilize training to make it not so dependent on some particular score scores or some particular positions so this part output is where we perform the weighted sum of the values so now this is where it takes context from every previous token based on the score the importance i explained here v is the matrix of values for every token in the sequence for example v of the the v of the cat v of the set so when we multiply uh attention scores and values this is what it looks like for example output for the word set it's gonna do 0.1 or 10 percent times value of the plus 0.6 times value of cat plus 0.3 times value of set and usually you will see that these tokens usually take most of their own value so maybe i should put 0.6 here and 0.6 there so that's how we will get weighted sum of the values for for each token of the every previous token into that token so that's how we do the context and then after we do that we need to now go back with the swapping heads and sequence length so now we need to put sequence left length after batch and so we have sequence length and then we will combine head number of heads ahead dimension to get model d because now we want to transform from the gigquery value space into the token space so we will use this this weight so this output layer so output is just our still it's our values value space and we pass that through this linear layer to get actual token space token embedding but before we pass our values we need to we need to convert them to the proper shape so combine heads into one single vector then we have the transformer block where we combine all of that so here we want to initialize fit forward which we didn't make yet by the way so this is very simple it has it goes from model from token embedding into the hidden four times bigger and then go back to token embedding and in this hidden four times bigger you have all of the facts living there all the knowledge and some ways to transform tokens as well because imagine attention as just collecting information into the token last token in order to predict the next token and then this feed forward will process that information mix it up do something with it create new insights so i also have drop out between those expansion and contraction layers so this is the whole formula so we pass our token into first linear use activation like relu although this could be maybe silu swiglu but let's now use relu relu is simple and fast to compute then drop out to not to make so to prevent our fitting and then scale back into the token size in our transformer block we will initialize causal self-attention speed forward and then two normalizations in this case it's layer norm but i should actually change this to rms norm rms norm is just used more these days in the dropout as well and this is the forward of the transformer okay so we pass in first through the attention i guess we get input and mask here although we could probably generate mask within attention as well passing through the attention and then normalize although we could also put normalization before attention but let's keep it this way now and then uh feed forward pass through feed forward normalize and that's it that's our transformer block and then we have multiple of those blocks and the output head and token so we need to now combine everything into one here we need the entire vocabulary of tokens so token embedding vocabulary will just be all of the tokens vocup size and each of the tokens has its own embedding so vocup size times d model then we also define these list of blocks of transformer blocks that i was just explaining we will just do multiple number of blocks is going to be three in our case if you remember from the beginning of the video and this is i guess final norm normalization that we get and this is the output head that generates the next token so this is after all of the blocks after everything output head will generate next token and it has d model times vocup size so so it's basically gonna for every token in the vocabulary it will generate probability that this is the next token so it converts our vector embedding from the transformer into set of probabilities 50 000 probabilities if we have 50 000 tokens it's just a linear layer vector embed token embedding into set of probabilities over entire vocabulary so this can actually be huge and you need some you can so this can be one third of the entire language model or small models but there are some tricks for example apple's cut cross attention maybe writing some kernels or gpu code and let's see the forward pass through our gpt gpt style decoder only transformer we get our batch size sequence laying the shapes we get a device which will be gpu we can create masks here or we can do it in the attention so this is how we create mask as i said i talked more about unsqueeze in my and torch ones and these pytorch in my pytorch fundamentals course in this school so we just want to make sure that you have a triangular matrix where um we have like a mask such that when we apply this to the attention for every single query it will only be able it will only have unmask the previous keys and it will have masked the following keys so this mask will for example be for the first token it will have all of the masks for the following tokens it will just have one unmask for itself to pay the attention to itself to do query and key of itself for the second token it will have unmasked itself and previous token but masked all of the next tokens third token will have first three so you you see how you have one two three four so it's like a triangle of masks and this is triangle here is also masked and this is unmasked so that's why we need this triangular matrix and in the future maybe we will talk more about this as well then we will pass all of our tokens through our token embedding layer to replace all of the tokens with their embeddings so this will give us like list of embeddings instead of the list of tokens and this is good for stabilization dividing uh these tokens with the square root these embeddings with the square root of the model dimension as i said you want to keep like your numbers around one around zero so that's what we this will do if you push them towards one then we pass our x through our blocks so for block in self blocks so we have three blocks we have we pass x through every transformer block and what block includes attention and feed forward so we do that three times here in this loop and then we normalize and generate the probability over all of the possible tokens to be the next token but these are lodges these are not probabilities these are just like numbers so we can later turn them into probabilities with softmax but basically these logics are also like numbers higher number means like it will later become higher probability for this token to be chosen okay and then this is how we use and train this gpt so we initialize it we select the device which is going to be cuda means gpu nvidia gpu if it's available you better not train on cpu that's not gonna be fun it's gonna be so long and then put the model to device to to gpu so because you initialize it here it can be initialized on cpu you want to move it to gpu and then we can have like this uh some dummy data so this uh source it's just gonna be our random tokens example tokens so this is let's say we have like some text here but it's now randomly generated some tokens some sequences and stuff and this is for the model training we need cross entropy loss to check the models prediction versus the real next token and we will use adam to update weights to make the model learn based on this loss to next time generate better loss and loss in large language models should be around one to two to three if you have 0.001 or 10 your loss then that's there is some bug i had 0.001 loss because i had some bug and then i discovered it and so loss should be around one two three and then model.train this will train because we are inheriting all of these like train and stuff from pytorch this is done by pytorch so this is how you train large language model you have a sequence so your inputs are gonna be entire sequence except the last last token and your target is gonna be entire sequence except the first token and i'll explain why but you see here we take all of the batches all of the conversations all the sequences i should say and everything inputs is everything except last and target is everything except the first token so from first so index zero token zero is not because here in the inputs we will start with just one token and we want to predict the next token so the target cannot be the first token because we can start from the first we cannot start from like nothing in library so we're not training model to start from nothing we are training the model to start from at least one token so that's why target cannot be that first token okay and inputs will go until last but not last because during the training we can have it's possible to have all of the tokens all the sequence as inputs except last when we are predicting the last okay so all of these combinations are possible so you can have like half of the tokens uh given and you need to predict the next half or you need to okay so you wouldn't predict you will just predict the next token so let me simplify all of the possible combinations when training large language model are you can have one token and predict next you can have two tokens predict next you can have three tokens until you have almost all and predict the last so that's why inputs can be from zero until last but not last and targets can be from the second token until last so all of the possible combinations where you have at least one to predict the next or until all except last to predict last so as i said we will get logits like numbers when we pass this through the model and so in this case it looks like we're not using softmax that's okay too um to put them into percentages we can also calculate loss by just comparing the current logits to the target tokens so the highest logit should be same as the target the correct token and if it's not then we calculate the loss update weights do optimization step so we empty out gradients from the last backpropagation you calculate all of the chain rule the gradients and then you update weights to reduce the loss so that's gonna be it for our gpt style transformer i'm gonna put this video here and see you in the next lesson