The Elegant Math Behind Machine Learning

[00:00:00] Anil Ananthaswamy: If you think about us humans, nobody has sat around labeling the data for us. Our brains over evolutionary time have learned about patterns that exist in the natural world. So given that that's how nature has done it, there's no reason to expect that the machines that we build are also not going to be powerful just because of that technique. I honestly, sincerely believe that we can't leave the building of these AI systems to just the practitioners. We need more people in our society whether they are science communicators, journalists, policy makers, just really interested users of the technology but who have some math background or people who are just willing to persist and learn enough of the math to make sense of why machines learn. It's only when we understand the math that we can point out that hang on these things are not reasoning in the way we think we are reasoning. It's because the math clearly shows that what's happening right now is that these machines are just doing very sophisticated pattern matching. [00:01:02] Speaker 2: Welcome back to MLST. We are interviewing the author of this book, Why Machines Learn by Anil Anathaswamy. Anil was flying through the UK on July the 17th on his way to India. He had a stopover for about 12 hours and I invited him to come and do an MLST interview. Unfortunately, there was a schedule clash. I think I thought he was going to be here the day before, maybe the day after. And I had to get my good friend Marcus to pick him up from the airport, take him over to the studio and ask the questions on my behalf. So I'm going to re-record the questions. I mean, you know, unfortunately stuff like that happens. But I'm very, very pleased that I managed to get the main man in the studio, even if I wasn't there. So yeah, why machines learn? What is this all about? It's a really interesting kind of pedagogical history of the field. But going into some of the underlying mathematics behind many of the approaches in machine learning. Anil is a veteran science writer. You should look at some of the other books he's written. He's really, really good. The book was written beautifully. I enjoyed reading it. Oh, by the way, he signed it as well. Pretty cool. I hope you enjoy the conversation with Anil. Can you introduce yourself? [00:02:19] Anil Ananthaswamy: Anil Ananthaswamy: My name is Anil Ananthaswamy. I'm a freelance journalist. I trained as a computer and electronics engineer. I did my bachelor's in India and my master's at the University of Washington in Seattle. Anil Ananthaswamy: I worked as a software engineer for a few years before I started feeling the itch to become a writer. And at some point I figured out that the two things I love, science and writing, could be combined. And that I could actually become a science journalist or a science writer. So I went back to school, studied science journalism. Came to London to do an internship with Neoscientist magazine. Anil Ananthaswamy: I was with them for six months doing the internship and that eventually led to a staff position. I was staff writer in London, became physics news editor, then became deputy news editor and wrote for Neoscientist for a long time. And while I was doing that, I was also working on, I started working on my books. And the first one was called The Edge of Physics. It's a travelogue-based book on cosmology and astroparticle physics. And each chapter is essentially a piece of a piece of travel writing, where I go to some really extreme locations on Earth, like the Atacama Desert in Chile, to Lake Baikal in Siberia in peak winter, to places like Antarctica, all the way to the South Pole. And so that book explores essentially extreme physics. And the second book is called The Man Who Wasn't There, and that's an exploration of the human sense of self. So when you ask the question, who am I, you kind of get answers from theology and philosophy. And in this book, I tried to answer that question from the perspective of neuroscience and neuropsychology. The third book was Through Two Doors at Once, which is an exploration of... It's essentially the story of one single experiment called the double-slit experiment, which is an extremely mysterious experiment to explain with our standard way of understanding the world. And yet it's very illustrative of what's happening at the quantum mechanical level. So it's really a story about quantum mechanics and quantum foundations, but told through the lens of one experiment, all the variations of that experiment done over 200 years. And finally, my last book is the, you know, the book on machine learning. It's called Why Machines Learn, and it's about the [00:05:02] Speaker 2: mathematics that underpins modern artificial intelligence. What inspired you to write about the elegant mathematics of machine learning? And can you give an example that you find extremely exquisite? [00:05:13] Anil Ananthaswamy: I'm writing about particle physics or cosmology or neuroscience. I never felt like that was something I could do as, you know, personally. It was more about understanding the science and writing about it. But over the last few years, I found myself writing more and more about machine learning. And given my software background, given that I used to, you know, be a software engineer, every time I would write stories about machine learning, I think the software engineer part of me woke up, like I would look at those stories and get this desire to actually get back into doing a little bit of coding to actually understand this technology from the ground up. So about five years ago, I did a fellowship at MIT called the Knight Science Journalism Fellowship. And as part of that fellowship, I decided to teach myself coding all over again. So 20 years after I had stopped doing any programming, I literally went back to, you know, the computer science 101 kind of classes, sat with teenagers and taught myself Python programming and PyTorch and started building some very rudimentary machine learning systems, well, one or two small things that I learned how to do. And as part of that exploration of trying to build a deep learning system, a deep neural network based system, I got more and more interested in understanding the kind of mathematical underpinnings that the basic theory behind machine learning. And towards the end of my fellowship, COVID happened, we were all stuck in our apartments, and I spent a good six, seven months, basically stuck in an apartment by myself, both in Boston and in Berkeley, California, listening to all these machine learning lectures over and over again, teaching myself essentially. And at some point, I started realizing that the mathematics that underlies machine learning is quite beautiful. And I think then the writer in me woke up saying, oh, I really need to communicate these ideas to my readers. So that's how the idea for this book came about, you know, why machines learn, which is essentially really about some of the conceptual mathematical principles that underlie modern artificial intelligence. Yeah. And regarding, you know, what is elegant about the mathematics of machine learning, a lot of people will say, oh, you know, machine learning is, you know, mainly about knowing calculus and linear algebra and probability and statistics. What's particularly elegant about that? And I'm not talking about those subfields of mathematics. For me, the beauty and elegance that I found in when I was learning about machine learning had to do with some of the theorems and proofs that I encountered. Like, for instance, if you go back to 1959, when the first artificial neural networks were being designed, there is a proof called the perceptron convergence theorem and its proof. And it's a very, very simple proof just based on, you know, linear algebra. And it was while listening to a professor explaining that to his students in Cornell that I kind of, I think, fell in love with the subject. I really felt like, okay, this is something I really need to tell readers that there is something wonderful, you know, in this whole subject. So the perceptron convergence proof is an example of what's, you know, really lovely and elegant about the mathematics of machine learning. With a caveat that, you know, things like elegance are always subjective. What I might find beautiful and elegant may not be somebody else's cup of tea, but, you know, that's how it goes. There's also, for instance, a technique called kernel methods, which is this very, very interesting idea where you take data that exists in low dimensions and project it into high dimensions into, you know, much, much higher dimensional space, possibly even infinite dimensional space. And the entire method, you know, these kernel methods, what they do is they rely on the mathematics that needs to happen in the high dimensional space. But the computations that are done are always in the low dimensional space. So there is a function or a kernel function that kind of projects this data into high dimensional space. And all of your, you know, algorithm is functioning in the high dimensional space, but the actual computation happens in the low dimensional space. And that whole process of taking low dimensional data, pushing it into high dimensions, doing what you want, you know, in those high dimensional spaces, but actually not really doing any computation in the high dimensional space. It's really lovely when you look at it. It's quite beautiful and very powerful. So there were a lot of ideas like these that I found as I was doing my research that almost made it very easy to come up with a list of things about which to write. [00:10:46] Speaker 2: What basic mathematical disciplines do you find essential for machine learning? [00:10:50] Anil Ananthaswamy: So for me, when I, when I wrote this book, I was thinking of, you know, people who have maybe a high school level, you know, or first year undergraduate level mathematical education, and now want to learn something about the basics of machine learning. So we're not talking of people who are going to become practitioners, but it's basically people who need to understand machine learning at more depth than is possible if you were just to read magazine articles. So for that kind of audience, I think the disciplines that you really need to kind of get come to grips with is basic calculus, some trigonometry, linear algebra, some elements, you know, basics of probability and statistics, and a little bit of optimization theory. It's not a whole lot. But when these pieces all come together, you kind of get a very good sense of why machines learn, you know, why they do the things they do. [00:11:49] Speaker 2: Many of the recent AI advances seem quite empirical. How much of the mathematical foundations do you [00:11:56] Anil Ananthaswamy: think are important to grasp machine learning? I think it's true that modern AI, or modern machine learning, which is essentially based on deep learning and deep neural networks, there is a lot of empirical stuff that is happening. People are just building things and finding out that they work this way or that way without really understanding why these algorithms work the way they do. And in order to really understand why these systems are powerful or what their limitations are, I think the answers to those questions actually will come from figuring out the mathematical foundations of these algorithms. Right now, the way the field is, I think there's a lot more empirical evidence about, you know, the workings of these machines. And we're still struggling to figure out the exact mathematical formulation that can explain why these things work as well as they do, or for that matter, what their limitations are. Because until we know, you know, all the pros and cons of these machines, from the perspective of the mathematics, it's going to be hard to put upper and lower bounds on what these machines can or cannot do. [00:13:14] Speaker 2: How does your book showcase the rich history of the field, you know, of machine learning beyond just deep learning? [00:13:21] Anil Ananthaswamy: I mean, if you ask anybody today, you know, about what AI is, you know, people on the street, they will probably say, oh, it's chat GPT. And yes, you know, these large language models have made a big splash. They use a form of technology called deep neural networks and deep learning. But that's not, you know, the entire history of machine learning goes back a long way. And there's a lot of other stuff that has happened that is not about deep learning. You know, we, I mentioned earlier that the early history of deep neural, of neural networks, of artificial neural networks, begins sometime in the late 1950s, early 1960s. And those were what were called single layer neural networks, essentially one layer of neurons, artificial neurons. And the algorithms that were designed were enough to train those single layer neural networks to do some task. But it was, it became clear very soon that if you had more than, you know, one layer sandwiched between the input and the output, this layer that sandwich is called a hidden layer. And if you had more than one hidden layer in your network, you could not use the algorithms that you had to train them. And so, and these single layer neural networks, even though you could train them, couldn't really do a whole lot. So, you know, by the end of the 1960s, people had kind of given up on neural networks, thinking that these things are not going to be very useful. And, but machine learning research didn't stop. There were a whole range of other things that were happening that were non-neural network based ideas. So for instance, also in the 1960s, a very powerful algorithm was analyzed mathematically, and it's called the k-nearest neighbor algorithm. That was really popular. There were techniques that had to do with using, you know, Bayes' theorem and other statistical ideas to develop, you know, algorithms that were really powerful. Probably my favorite non-neural network based machine learning algorithm is the support vector machine. Support vector machines came about in the early 90s and kind of dominated the pre-neural network era for a long time. And these are machines, these algorithms are algorithms that try to find an optimal solution to some classification problem. And they also incorporate as part of the algorithm, the kernel methods that I just talked about, you know, this idea of taking lower dimensional, you know, data and projecting to higher dimensions, finding optimal margins in the higher dimensions, but doing your computations only in the lower dimensions. So the combination of optimal margin crossifiers and kernel methods made these support vector machines really powerful. So there's a whole range of stuff that one can talk about that happened between sort of the late 1950s and early 1960s when the first neural networks came about. And, you know, the last decade or so when deep neural networks have come back in full force, right? So, and the book does deal with the intervening history also, because I think the mathematical concepts that underlie those other algorithms are really crucial to understanding what is happening inside these machines in terms of how they represent data, how they see the world, you know, what they do in terms of manipulating the data. [00:17:21] Speaker 2: Which criteria did you use to select the algorithms and concepts that you spoke about in your book? [00:17:26] Anil Ananthaswamy: I had two hats on when I was trying to think of what kinds of things to put in the book. The first, probably the most important criterion was that the algorithms were useful for demonstrating some very key mathematical idea. Like for instance, the K nearest neighbor algorithm is very, very important for understanding how data, you know, is turned into vectors and how these vectors, you know, are mapped onto some high dimensional space. And the relationship between vectors is what determines how this algorithm does its job. And, you know, using the K nearest neighbor algorithm to kind of give the reader a whole, an in-depth understanding of how data gets converted into vectors and then gets embedded in these high dimensional spaces, right? So a lot of times I was focused on making sure that every algorithm that I selected was highlighting some key aspect of something mathematical that was crucial for developing an overall picture of what the machines are doing. Again, this is subjective. some other person, some other, you know, writer could have chosen a slightly different set and you could still make the case that that other set could also be illustrative of the mathematical concepts. So after figuring out that I needed to address a particular set of mathematical concerns, I also had my writer's hat on, right? And the writer's hat is basically making me choose algorithms which have some sort of story behind them. So to make the story engaging for the reader. So it was not enough that there was very good math underlying these algorithms, but that the development of the algorithms themselves had a story to tell, you know, I could tell a story about them. And I honestly very strongly believe that we understand things better when whatever we are understanding is anchored in stories. And so it was a dual task of finding algorithms that had key mathematical elements to them, but also had, [00:19:54] Speaker 2: you know, substantial stories underpinning them. What are some of the basic mathematical disciplines that need to be grappled with in order to get under the hood of machine learning? [00:20:04] Anil Ananthaswamy: I would say, calculus, absolutely. Basic calculus, nothing very fancy. Linear algebra, again, depending on whether you're going to be someone who's going to be building these systems versus someone who's just going to be using this math to understand what's happening and not necessarily, you know, doing research or going going ahead and building them. If you're using the math to just get a sense for why these machines are doing what they're doing, then even linear algebra, you don't really need a whole lot of it. You need to you need to understand the you know, concept of vectors and matrices and how do you manipulate these vectors and matrices and, you know, it's not very complicated stuff. You just also need something about the basics of probability and statistics. You need to understand Bayes' theorem, for instance. And again, these are not terribly difficult. And a little bit of optimization theory, again, that sounds like a fancy word, optimization theory, but there are some very basic techniques that we need to understand to figure out how these machines are essentially learning. You know, they are using certain techniques for optimizing their parameter space. And so, yeah, it's not a whole lot of complicated math, at least for people who want to understand or peek under the hood, so to say, as you put it. If you, of course, if you want to build these systems and if you want to start doing research, then your mathematical chops have to get much more sophisticated. [00:21:42] Speaker 2: Can you explain the bias-variance trade-off in machine learning? [00:21:46] Anil Ananthaswamy: Yeah, the bias-variance trade-off is a very classic trade-off. And the basic idea here is that when you're training a machine learning model to learn patterns that exist in the data that you've shown it, if the model is too simple, you know, and let's say we are categorizing the simplicity or the complexity of the model in terms of the number of tunable parameters it has. Things that you can, you know, different knobs that you can turn to figure out what the model does. So if the model has too few parameters, then when it's being fed data and and it's being asked to figure out the patterns or correlations that exist in the data, if the model doesn't have enough parameters, then it's going to underfit the data. It won't do a good job of basically figuring out what those patterns are. And so, such simple models that are underfitting the data are said to have high bias. But then you can, you can start making the model more complex. And by, again, by here, by complexity, I'm just maybe as a proxy, I'm using the number of parameters that the model has. And as you keep increasing the number of parameters, there comes a point where the model starts overfitting the data. If the data has a lot of noise in it, for instance, it's actually going to fit all the noise. It's as if like, you know, a simple model might have drawn a straight line through the data that you have, but a very complex model is going to basically draw a very squiggly curve, you know, based touching every data point that you have. Some of it could be just noise. So you essentially end up overfitting the data. So, and when you have a complex model that overfits the data, you are in the high variance regime, right? So if you are now testing how the model is doing on training data, how much error does it make when you're given a training data and you're asking it to fit the training data. When you're on the low bias side, the risk of training error is pretty high. It's making a fair amount of error even on the training data. But as the complexity of the model keeps increasing and you're moving towards high variance, the model starts fitting the data really well until it overfits it. So on the high variance side, you basically, you basically now have zero error that you're making on the training data. But what's interesting here is that there is a certain amount of data that you hold out from the machine. You don't show the machine a certain amount of data. Let's call it the test data. And when you test the machine that is being trained on this held out test data, then in the beginning on the low bias side, you will still make a lot of error on the test data. And then as the model gets more and more complex, your error, the error that you're making on the test data starts falling. But then at some point, when the model is starting to overfit the training data, the error that you're making on the test data starts to rise again. So it's almost like there's one curve that is just going, you know, asymptotically down to zero, which is the risk of training error. But there's another curve which is kind of bowl-shaped. It kind of comes down and then to a minimum and then starts rising again. And that's essentially the bias variance curve. You want your models to be in that Goldilocks zone where you're making a low enough error on the training data, but also your error on the test data is at the minimum. And that's the trade-off. You don't want to overfit the data and you don't want your model to be too simple. [00:25:44] Speaker 2: What is the role of overparameterization in deep learning models? And can you explain the last chapter in your book, which was Terra Incognita? [00:25:54] Anil Ananthaswamy: So this bias variance curve that I just talked about, you know, as you're making the model more and more complex, it's getting more and more parametrized in the sense that the number of parameters in the model are increasing. And as it happens in deep neural networks, what has been noticed is that the number of parameters that the model has far outstrips the instances of training data. And standard machine learning theory, which is what this bias variance curve that we just talked about is based on, is that as you overparameterize, as your number of, you know, model parameters become much, much larger than the instances of training data, you should essentially overfit the training data, you should be in that regime where you're overfitting. And so the loss that you make on your test data should, you know, keep rising. And it turns out that that's sort of not what happens in deep learning. We don't have a good theory for why that's the case. Deep learning systems, deep neural networks seem to be flouting some of the accepted norms of standard machine learning theory. So even though they have, they're heavily overparameterized, they do well on the held out test data. And this is called, you know, an ability to generalize or the generalization error that they may, that they have is actually low. So they are showing a capacity to generalize despite being overparameterized. And the honest answer is, we don't know why that's the case. And the reason why in my book, I call this aspect of deep learning systems, terra incognita, it's not, not, not a term I came up with, it was something that one of the researchers that I was talking to said, he basically talked of, if you have the, I just mentioned the bias variance curve, the standard machine learning systems kind of live in that region of the standard bias variance curve. Deep learning systems, it, as it happens, your training data keeps falling and goes to zero and your test error, you know, reaches its maximum at the point where the training error reaches zero. At that point that the machine learning system is set to have interpolated the training data. But then what they notice is that if you keep training, the tester starts falling again. And there is a portion of that curve now, which is kind of unknown territory, we don't really know why the machine learning system behaves or in this particular case, why the deep learning system or the deep neural network behaves in that, in that manner. And that, that part of the bias variance curve, it's also called double descent, is terra incognita, basically because we [00:29:00] Speaker 2: don't know why it's doing that. How does your book address the apparent contradiction between the statistical principles underlying traditional machine learning versus this crazy world that we live in now with these over-parameterized deep learning models? I don't think we have a mathematical [00:29:19] Anil Ananthaswamy: understanding of the apparent success of deep neural networks, even though they're heavily over-parameterized, right? The empirical data is certainly, it certainly requires more mathematical theory to explain why, why that's happening. We don't know the answer to that. So I don't think my book reconciles the two. It basically points out that there is standard machine learning theory, which, you know, which tells you that this is how machines should work, you know, machines that learn should work. But, but we also know, just from the empirical results that we have about deep neural networks, that they are not behaving the same way. So the last chapter of my book essentially sets the sets this up as, as a mystery, you know, not a profound mystery. I think people have some clues as to what's happening. But really, the the formal mathematics is still lacking about why that's the case. So I wouldn't say that the book reconciles them, it just hopefully does a good job of explaining what the situation is, and telling the reader that we are, we have literally entered unknown territory with those with these deep neural networks. [00:30:37] Speaker 2: What are your thoughts on self supervised learning? So for example, ChatGPT, where we just train a model on the data itself, using the data as a label? [00:30:46] Anil Ananthaswamy: I think self supervised learning was a really big breakthrough in machine learning. Because until then, we used, you know, the other type of learning, which is supervised learning, where humans had to annotate the data and tell the machine what that data meant. And then, you know, supervised learning is limited by the fact that we need human input to annotate all the data. And that's very, very expensive. So you, your ability to have extremely large data sets that the machine can, you know, analyze is restricted purely by because of cost. And, and also, when humans annotate data and give labels to the data or categorize the data, the kinds of things machines learn by looking at the data and then trying to match, you know, patterns that exist in the data to humans supplied labels is a very restrictive kind of learning. It's learning something very particular, right? So for instance, if you had a bunch of images of cows and a bunch of images of dogs that humans had labeled as cows or dogs, and the machine learning system was trying to figure out, oh, you know, this is an image of a cow. And this is an image of a dog. It might just pick up the fact that most of the cows are always in fields. So it might completely ignore the fact that there's a cow there, as long as it sees some grass, it says, oh, that's the image of a cow. And dogs maybe mostly are indoors or whatever. And so the way, the kinds of things it might pick up in order to match the patterns that exist in the data to human supplied labels might be very counterproductive. It might be doing exactly not, you know, the wrong thing, or it might be doing things that are not particularly useful. Self-supervised learning was a very interesting breakthrough because essentially what the entire technique relies on this idea that you can take a piece of data. Humans don't have to label it as anything. Humans are not involved in the mix. All you do is you take, let's say you take an image and you mask a, you know, portion of the image, let's say 50% of the image you mask. You feed the masked image to the machine learning system and ask it to predict the entire image, the unmasked image. You implicitly know what that unmasked image should be because you had it on the input side. But when you're asking the machine to complete the entire image by filling up the masked portion, in the beginning it's going to make errors. It's going to come up with some nonsense. But you know what the right solution is, because you always had that actual input in the first place. So you can, you can tell the machine that, oh, you've made an error and this is how much error you've made. Go and tune your parameters so that you're a little bit closer in your prediction the next time around. And you do this iteratively over and over again until the machine figures out how to take some masked image and generate the, you know, full image. And in doing so it learns features about the image that maybe wouldn't have been possible with supervised learning. Because here there's no label that is trying to match. It's actually trying to understand the structure, the statistical structure of the image itself. And something similar happens with language, you know, the kinds of things that ChatGPT is doing, right? You take, you take a sentence and you mask the last word of that sentence and ask it to predict the last word. It's going to make an error in the beginning, but you know what the last word is because you had that sentence in the first place. And, you know, you take the amount of error it makes, tune the parameters of the model in such a way that if you give it the same sentence again, ask it to predict the same missing word again, it, it will make a, it'll make an error again, but you know, it'll get slightly better. And you do this over and over again for that sentence until it gets it right. Now imagine doing this for every sentence on the internet. And before you know, it has learned the statistical structure of human written language. And so then after that, no matter what sentence you give it and mask, you know, a word, it knows how to predict the next word, right? So the amazing part of about self-supervised learning is that it can be easily automated, like there's almost no human intervention here. And the machine is really learning some very sophisticated statistical structures that are inherent in the data. Do you think the future is supervised or unsupervised? So these are not my words. These are words that come from Alexei Efros at UC Berkeley. And he has very authoritatively said that the revolution will not be supervised. So basically implying, well, not even implying, explicitly saying that the revolution in AI will be unsupervised. Again, one obvious reason is that supervised learning requires human intervention in the sense that humans have to label the data, they have to annotate the data. And that's just not going to be possible at scale. You can do it for small data sets, even reasonably large data sets, but really to keep scaling up is going to be impossible. But also the kinds of things that a self-supervised system learns is very different from what a supervised system learns. So there's a richness to the learning that's happening in self-supervised systems. But for me, probably the biggest philosophical reason to think that the revolution is going to be self-supervised is that, you know, if you think about us humans, you know, nobody has sat around labeling the data for us. Our brains over evolutionary time have learned about patterns that exist in the natural world and have figured out how to help, you know, the body do its thing, move towards food, away from, you know, predators, towards prey, you know, find a mate, find food. All these things are, have happened in an unsupervised manner. And yes, of course, over the course of the developmental stages of a child, you know, parents do supervise their kids and we do some form of supervised learning. But that's a very small part of what humans learn. Much of what we have learned over evolutionary time and much of what we learn, even as we grow, is self-supervised. Or unsupervised. So given that that's how nature has done it, there's no reason to expect that the machines that we build are also not going to be powerful just because of that technique. [00:38:03] Speaker 2: Why does stochastic gradient descent work so well, given the complexity of the optimization problem? [00:38:09] Anil Ananthaswamy: Well, again, this is one of those, one of those things where we have empirical evidence that stochastic gradient descent works. Exactly why it works so effectively in optimizing deep neural networks is still an open question. There has been some work that suggests that the reason why stochastic gradient descent works is because it acts as an implicit regularizer. I can never say that word properly, regularizer. So, and the reason why it might be working is because it is automatically, or as part of the optimization process, it's pruning the number of parameters, making the model simpler so that it doesn't overfit and hence finds the, you know, necessary optimum. But there has also been work that has shown that deep neural networks will still find the optimal solution or near optimal solution even without stochastic gradient descent. So it doesn't seem like there is something particular about a regularization that has to do with the stochastic gradient descent that is responsible for its efficacy. So again, the honest answer here is that it's an open question and we know it works. We know it works amazingly well, even when it shouldn't, it seems like it's such an ad hoc thing to be doing, and yet it works beautifully. It's of course very efficient. It's much faster than using pure gradient gradient descent. But the exact reasons behind its efficacy are still not clear. Can you explain the curse of dimensionality? So when you when you think of something like the K nearest neighbor algorithm, right, you take what that algorithm does is it turns data into vectors and plots them in, you know, in some high dimensional space. So let's say we have a, you know, 10 by 10 image, like we have a thousand 10 by 10 images of cats and a thousand 10 by 10 images of dogs. And a 10 by 10 pixels. And you can imagine each pixel as you know, if it's grayscale, then you know, that pixel has a value between zero and 255. So each image can be turned into a vector that is like a hundred numbers long. And that vector can be plotted in hundred dimensional space. So, you know, one pixel along one axis. And what will happen more or less is that all vectors representing cats will end up in one region of that high dimensional space. And all vectors representing dogs will end up in a different part of that high dimensional space. And then when you have a new image that you don't know whether it's a cat or a dog, you turn that image into a vector and then you plot it in that same high dimensional space and see, oh, is it closer to dogs or is it closer to cats? If that thing is closer to dogs, you call this new image a dog. If you, if it's closer to cats, you call the new image a cat, right? This procedure depends on this central idea that vectors that are alike are near each other in this high dimensional space or, or, or vectors representing similar things are near each other in this high dimensional space. So, uh, you know, the new image, which if let's say it's a dog, if you plot it in that high dimensional space should be close, closest to other dogs in that space. Now the funny, one, one funny thing that happens is when you move into higher and higher dimensions is that, you know, uh, let's say, let's say the image was, I don't know, million pixels. Uh, so now you're operating in, you know, a vector which has a million elements and so you are in a million dimensional space. Um, it turns out that the, the idea that similar things are closer in these high dimensional spaces than things that are not similar, that whole idea falls apart as you start moving into higher and higher dimensions. And that is the curse of dimensionality. You, you, the very metric that you use in order to compare vectors, uh, starts falling apart because in these high dimensional spaces, everything is just as far away from everything else. So the notion of similarity that two things are similar because they're close to each other doesn't work anymore. So, um, and, and that in a sense is the curse of dimensionality. And as your data starts becoming more and more high dimensional, uh, you cannot use some of these algorithms that rely on a notion of similarity, uh, by just using some distance metric between the vectors. [00:43:18] Speaker 2: Can you explain the context of emergence in language models and why do you think it's a little bit of [00:43:26] Anil Ananthaswamy: a slippery concept and challenging to explain? Um, emergent behavior has probably garnered, uh, more attention than it deserves. I mean, the, the term seems to suggest something mysterious and magical that's happening. And it refers to this idea that, uh, as large language models like chat GPT started getting bigger and bigger, they started demonstrating behaviors that weren't observed in smaller models. And in essence, that's all emergencies. It's basically saying that if, if there's a certain kind of task that you asked a smaller model like GPT-2 to, to, to perform and it failed, but then you built a larger model like GPT-3 or 3.5 or GPT-4, nothing fundamentally changed in the underlying mathematics or in the underlying architecture of these large language models. There's nothing different about the way they are trained. Uh, everything is the same. All that has happened is these models have been scaled up. They've become bigger. They've seen more data, but the fundamental sort of mathematics underlying their training, the fundamental architecture, uh, you know, that underpins these neural networks that hasn't changed. And yet when these things get bigger, you take the same problem that you gave to GPT-2, it could not solve it. And you give that problem now to GPT-3.5 or GPT-4 and it solves it. And that behavior is being called emergent behavior is emerging simply because you're making something bigger. Uh, it's certainly not magical. You know, of course, these systems have, uh, become bigger. They've seen more data. So they're able to do much, much more sophisticated pattern matching. They are, they're able to learn much more sophisticated correlations that exist in the data. So it's not surprising that, uh, that they're going to do things that the smaller models couldn't. Uh, but it's not like some kind of behavior that cannot be explained. The term emergence seems to suggest something mysterious and it's not. Depending on how you use the word emergence, you know, either you just define it simply as saying that, okay, all it is, is behavior that a smaller model couldn't, uh, do. And now that behavior is being observed in a larger model and it seems to do it correctly. Uh, if emergence is simply the fact that certain capabilities arise as you make the model bigger, uh, mainly because it has seen more data and it, it just has a larger number of parameters and hence is able to process, uh, the data in ways that the smaller model couldn't. If you just look at it that way, uh, then there's nothing to be skeptical about. It just makes sense that that would be the case. But if you want to use the term to imply something that is absolutely not understood, I mean, yes, there are aspects of why this happens that is still being worked out mathematically. But, uh, but if you have a sheen of, you know, mystery around it, then I think I would be skeptical. It's, it's, it's not like that. It's not a sudden appearance of some ability in a large language model. It is a very gradual, uh, uh, ability that emerges. I mean, also one of the things to note is that we build GPT-2, which has a certain number of parameters, and then we build GPT-3, which has an order of magnitude more. And when we test GPT-3, we see some behavior, which wasn't present in GPT-2. And we think that that's a certain transition, that something just happened between these two things. But the fact is that we didn't build, you know, so GPT-3 has 10 times more, let's say, parameters in GPT-2. We didn't build models that were, you know, twice as big, thrice as big. We just went from, you know, something that had one set of parameters to something that has 10 times more. But if you had built the intermediate stages also and checked their behavior, you probably would have seen a gradual increase in ability, not the sudden step change that seems to come about. So in that sense, again, it's not emergence in any magical sense that it just appears suddenly. It is a very gradual process. How do deep learning models compare with human cognition? I think we have to be really careful comparing deep learning models to human cognition or human cognitive abilities. There are models that people have started developing that model, for instance, the human visual system or the human auditory system, even the olfactory system. And they are the best models we have to date about what might be happening in the brain. But they are not, you know, exact models. They're not telling us exactly what's happening in the brain. They recapitulate some of the behaviors that we see in our biological systems, whether it's a human brain or other primate brains, but are they replicating the exact mechanisms that are there in our nervous system in our brains? Absolutely not. I mean, for instance, most of these deep learning models are what are called feed forward. The, you know, you have input coming in on one side and the information just flows from the input to the output. There is no recurrence. So for instance, if you have neurons in the 10th layer, the outputs, the outputs of those neurons in the 10th layer don't feed back to the 10th layer or the layers, you know, 9, 8, 7 and earlier. So the output of the 10th layer has to move forward. It has to go on to the 11th and 12th and so on. Uh, our brains are not like that. There are numerous, in fact, the number of recurrent connections probably outnumber feed forward connections in the brain. So there's a lot of feedback, uh, loops in the brain. And, uh, you know, the current models we have do not have this kind of recurrence. So whatever, however close these deep learning models seem to be to, uh, what might be happening in our brains, they lack a very obvious, uh, architectural details. So they can't be, you know, uh, exact, they can't be telling us about exactly what's happening, saying that they're the best we have right now. And they're definitely shedding light on how our brains might be processing information. [00:50:24] Speaker 2: How do inductive priors work in machine learning models? So things like symmetry and variants and, permutation and variants and stuff like that. [00:50:33] Anil Ananthaswamy: So inductive priors are essentially information that we can somehow incorporate into the architecture of the deep neural network based on ideas we have about, uh, how certain kinds of information need to be processed. For example, if you take things like convolutional neural networks, they were inspired by what we understand about the human visual system or the primate visual system. Um, and we know that, uh, that there's a certain hierarchy, uh, involved in the way our visual system processes information that's coming in. You know, there's a, there's a certain, uh, amount of processing that happens that has to do with identifying low level features of images. So for instance, if I'm looking at a, you know, a cup, uh, the visual system is identifying the edges, the curves, the shapes, the texture before it puts it all together and says, oh, this is a cup. Um, and, but this is happening in, you know, in stages. There's also invariance built into the human visual system. So for instance, if there is an edge detector in our visual system, that edge can, you know, be anywhere in the visual field and it should still be, you know, the visual system should still be capable of detecting that, uh, or the edge can be tilted and it should be able to, you know, still be able to detect that it's an edge. So there's rotational invariance, translational invariance, and we've taken these ideas that we learned from observing the, you know, uh, the animal visual system and incorporated those things into designs of deep neural networks. So, so that's how the first convolutional neural networks came about. So these were the inductive priors, so to say. So we, we had prior information about what these networks should be doing that were baked into the architecture of the system. So, uh, there are other, uh, other examples of this, uh, where we, we already build in prior knowledge about what we think we need in order to make more sense of the data into the architecture of the system. Can you explain the backprop, uh, algorithm and its history? The backpropagation algorithm is probably one of those, uh, uh, algorithms that I particularly, personally found quite elegant and is a significant part of my book. Uh, and it's also a very significant part of, uh, why, um, you know, deep learning and deep neural networks have succeeded so brilliantly. The basic idea behind backpropagation is very straightforward. Again, if you go back to the late 1950s, early 1960s, we just had single layer neural networks. So you provided the neural network an input, it produced an output, and then you, you figure out whether the network made an error by looking at the output and the expected output and what it does. You, you calculate an error, and based on that error, you just modify the strengths of the connections of the neurons, the weights of the neurons. Um, and those algorithms worked as long as it was just a single layer. The moment you put another layer between the output and the input, the so-called hidden layer, the algorithm, the algorithm couldn't work anymore. And the reason was that what you had to do was every time the network made an error, you calculated the loss that it made, uh, on its prediction. And you had to then figure out, there's this problem of credit assignment. You have to figure out how much of that error that the network has made, uh, should be apportioned to each of the weights of the network, right? Uh, if it was just a single layer, then it's easy to take that loss and apportion it to the weights of the single layer. But the moment you have a hidden layer, it was very hard to figure out how to kind of back propagate or, you know, move backwards from the output stage back to the input stage and allocate, uh, to each weight, what its responsibility was for the error that the network made. Uh, and this was something that Frank Rosenblatt, who came up with the perceptron algorithm in 1959, he was aware of. So he had in his book, uh, in 1961, and principles of neurodynamics, uh, he had identified this problem that look, the moment we have a multi-layer, uh, neural network, then you're going to have this problem of having to back propagate your errors, uh, from the output side all the way back to the input side so that every, uh, weight in your network, uh, is adjusted accordingly. He just didn't know how to do it. He had identified the problem. Um, also in the 1960s, um, there were, you know, aeronautical and electronics engineers who were building, um, uh, control systems for controlling the trajectory of rockets. Um, Henry Kelly and, uh, I forget Arthur Bryson, I think. Uh, so the algorithm is called Kelly Bryson. Um, they had some form of this back propagation algorithm, even though it wasn't called that, uh, to be able to design, uh, systems that could help control the trajectory of rockets as they're, you know, uh, going in space. Um, I think 1962 Stuart Dreyfuss came up with, uh, a use of the chain rule in calculus to actually make the Kelly Bryson algorithm, uh, better. So, so these elements were sort of slowly falling into place. Um, then sometime, I think, uh, 1967, there was a Japanese researcher, uh, Shinichi Amari, who also figured out some aspects of the back propagation algorithm. Again, none of these were, uh, very well fleshed out, but the, the kind of bits and pieces were falling into place. And, um, you know, there's a, uh, there's a whole history, uh, of this topic on Juergen Schmidthuber's web website that one can go look up, where he also mentions, for instance, uh, Sepul Inayma, who comes up, I think it would have been 1970, um, where he creates the code necessary for efficient back propagation. 1974, Paul Verbos, who was doing this PhD at Harvard, uh, develops what can be called the closest, uh, uh, sort of version of the modern back propagation algorithm for his PhD thesis, which had more to do with behavioral sciences. It wasn't really addressing neural networks. So all of this stuff was happening, but the real, uh, sort of breakthrough happens in 1986, when, uh, Rimmel Hart, Hinton, and Williams published their paper, just a four, three or four page paper in Nature, uh, about the back propagation algorithm. So now, finally, this algorithm was being talked of specifically for training neural networks and, uh, for, uh, neural networks with hidden layers. And, uh, you know, it, it, and they also not, not only did they kind of formalize the algorithm, but they also pointed out that if you use this algorithm to train multi-layer neural networks, they learn certain kinds of things they, uh, about the data. Uh, so they identified kind of what they call feature learning, um, or representation learning, uh, they could identify what kinds of things the neural networks are learning because you use this back propagation algorithm. So finally, in 1986, I think people woke up to the fact that, okay, there's this formal thing, uh, and, and rightfully or wrongfully, uh, a lot of the credit is given to, uh, uh, say in this case, Jeff Hinton, because, uh, uh, you know, he, he is currently regarded as one of the main people behind the back propagation algorithm. But even he would say that, look, if Rumelhart had been alive, he would be the guy getting all the credit. And not just that, he, uh, Hinton also acknowledges that there is a large history to, to this algorithm that they were just the people who kind of put it all together and made it, uh, sort of palatable to the neural network community, but the ideas predate them by decades. Do machine learning models reason? And [00:59:16] Speaker 2: if they do reason, why do you think they reason? And how do you think their reasoning is different to [00:59:22] Anil Ananthaswamy: ours? Not really. If you think of reasoning as what we do as humans, uh, we have this ability to learn something about how to solve a problem in a particular domain. Uh, not only do we learn how to solve the task, we are capable of abstracting the principles involved in solving the task. And then we are able to transfer those principles using, you know, symbolic language like mathematics or just language, uh, to then reason or, uh, about or solve problems in some other entirely different domain. And that kind of, uh, symbolic thinking is not what machine learning models are doing, right? Machine learning models are essentially very, very, very sophisticated pattern matching machines. So they, they can detect patterns in data that might even miss that humans might miss. So they're very good at that. Um, and it's true that there's a large class of problems that can be solved if you are a very good pattern matching machine, right? If you can identify, uh, you know, correlations between inputs and outputs and sophisticated statistical correlations at that, that might be sufficient, uh, for solving a large class of problems. And that's currently what's happening with, uh, these machines. So depending on what questions you ask them, if these questions are the type that only require the machine to really deep into its understanding of the statistical correlations that exist in the data and it can solve the problem, it will seem like reasoning when you look at the answer. Uh, but it's not reasoning in the, in the way we think of as human reasoning. Nonetheless, depending on where you set the bar as to what constitutes reasoning, right? Uh, you, you could say machines are reasoning, but only in a very limited sense, right? It, it, these machines right now, machine learning systems are essentially very, very sophisticated, uh, correlation machines. [01:01:39] Speaker 2: What do you think that readers will take back home from your exploration of mathematical, um, foundations in machine learning? [01:01:47] Anil Ananthaswamy: Uh, I think I would hope that readers of why machines learn, uh, are going to be a, uh, kind of appreciative of the, what I think is fairly elegant math, right? That underlies or underpaints machine learning that these machines learn because the math say that it's possible. So, um, so I would like them to be able to gain an appreciation for kind of all these goings on under the hood, so to speak, uh, the math that makes it possible. And, uh, it's almost like trying to, the math helps us kind of visualize and conceptualize how machines are quote unquote thinking. I mean, they're not really thinking, but you know what I mean. Um, so by understanding the math, we really do get a glimpse into how machines might be processing information. Um, the other, I think more important part for me is that I honestly, very, uh, sincerely believe that we can't leave the building of these AI systems to just the practitioners, to just the people who are building them today. We need, uh, more people in our society, whether they are science communicators, journalists, policy makers, just really interested users of the technology, but who have some math, uh, background or, or people who are just willing to persist and learn enough of the math to make sense of why machines learn, uh, in order to be able to appreciate, you know, we, we are best, you know, we are making these machines quite powerful and, and the power comes, uh, from the algorithms we design and the math that makes the algorithms work. So understanding the math is going to tell us about, uh, you know, how powerful these things are going to get, but it's also going to tell us about the limitations, right? So, so it's only when we understand the math that we can point out that, hang on, these things are not reasoning in the way we think we are reasoning. It's, it's because the math clearly shows that what's happening right now is that these machines are, you know, just doing very sophisticated pattern matching. [01:04:07] Speaker 2: So ChatGPT, it, it hallucinates. Sometimes it gets the answer right. Sometimes it gets the answer wrong. Do you think that affects their reliability and utility in real world situations? And I guess as an extension of that, do they understand, right? And, and what would it mean for them to understand? [01:04:27] Anil Ananthaswamy: Yeah, it's true that, uh, LLMs are always hallucinating. I think the term hallucination has often colloquially been used only when LLMs get things wrong. But if you look at the way LLMs function, everything that they're doing is essentially hallucinating. And, and, and I think that word really loses its meaning if you realize that that's just how they work. They are essentially, you know, given a piece of text, they are producing the next most likely word to follow that text. They append that word to the original piece of text and they predict the next most likely word, and then the next most likely word and so on until they produce like an end of token or end of text token and the whole thing stops. So at each stage, it's essentially a probabilistic statement about what is the most likely word to follow the text that you've already given. It doesn't matter whether the answer is right or wrong. The process is always the same. It just so happens that when the LLM is big enough, these probabilities that it is internally generating in order to make its best guess about what should come next, they get better and better. So the answers can start looking like the LLM is reasoning or the LLM is thinking, etc. But the process, whether it's getting it wrong or whether it's getting it right, is always the same. So given that they are using the same process, so-called hallucination to whether, you know, to come up with answers that are either right or wrong, it's really hard to know when the answers they're producing is correct and when it's wrong. It almost requires a human expert to be able to look over what an LLM is producing in order to ensure that it's producing the correct output. Now, there will always be certain tasks that an LLM can be asked to do where most of what it does, even if it gets things a little bit wrong, is still pretty amazing. For instance, you know, when you're doing Python coding, these LLMs can be extremely good assistants. They can, you know, generate so much code, so much, you know, and so fast that a lot of your basic coding is already done. And if you have enough expertise, you can look it over very quickly and make sure it's doing what it's supposed to do. So they can be very good assistants as long as the human who's using them has enough expertise to be able to tell right from wrong. But are they actually understanding what they're producing? This is a matter of huge debate. It really depends on what you define as understanding, where you set the bar for what constitutes semantic understanding of language. And depending on where you set the bar, LLMs either clear it handsomely, they're very good at it, or they fail miserably. And it's really up to, you know, it's really definitional. If you define understanding in a way that, you know, only humans will ever be able to answer those questions, LLMs will probably fail them miserably. But there are certain things that LLMs do that are just as good as what humans can do. Because the notion of understanding is set at that level. So this is a question of semantics. [01:08:06] Speaker 2: And I would say the debate is still playing out. What is your definition of intelligence? And do you [01:08:14] Anil Ananthaswamy: think that deep learning models are intelligent? Well, intelligence is a really, really difficult term to define. I don't think I even try defining it in my book. Most people who write about AI try not to define it. But I think the reason why it's hard to define is because intelligence means different things in different contexts, right? The kind of intelligence that a dog needs to have to function in its environment is very different from the kind of intelligence, you know, an elephant might need, or a whale might need, or for that matter, humans, right? So our intelligences, each particular type of intelligence is the outcome of having a particular kind of body that has to navigate its environment and function in its, you know, cultural context, or social context, or whatever it might be. And as long as the nervous system and the brain and the body all taken together are capable of helping the body function in its environment to peak capacity, you would say that that system is intelligent for that purpose. And so it's hard to come up with just some sort of, you know, abstract notion of intelligence that applies across the board. So if you think of intelligence like that, are AI systems intelligent? Again, it's a matter of what you're defining the task. There are certain tasks, you know, if intelligence is playing chess, without really knowing how the machine is doing it, let's say all you're doing is trying to play chess with a machine, and you're defining the ability to win at that game of chess as a kind of intelligence that is necessary to play chess. Yes, machines are intelligent. They can, they can beat us hands down now, pretty much anyone, when it comes to playing chess or so many other games, right? This is not about what's happening under the hood. It's just about looking at the behavior and saying, is the behavior manifesting a kind of intelligence that, you know, is required to achieve the goal. So yeah, I think to me, this is a slippery slope, you can define it however you want it. And in some cases, the machines will be termed intelligent. In other cases, absolutely not, right? So yeah, we have to be very careful about how we use this term. There is certainly no such thing as a completely general intelligence that somehow abstracts away all notions of intelligence and makes it decoupled from the bodies in which we function. So [01:11:05] Speaker 2: may be possible at some point, but I don't think we're there yet. Do large language models have agency and what does that mean? Agency from the perspective of humans is [01:11:17] Anil Ananthaswamy: this feeling we have of being agents of our actions, right? So if I were to pick up a mug of coffee, I have an implicit feeling that I will with that action into existence and that I am the agent of that action. And there is an internal sensation of being someone who is directing this body's actions in the world and also being the recipient of the experiences, right? So there is, we just have that feeling of being agents. Now our AI systems at this point, do they have a sense of agency? So we can certainly build, you know, robotic systems that model themselves as agents in the world. So that's very different from saying that the robot has a sense of agency, that it feels the way about itself, the way we do about ourselves. I would say that at this point, we can certainly build robotic systems that act as agents in the world. But I don't think anyone would really claim at this point that they have an internal sense of agency. Those are two separate things and we're a long way from having robots that can claim to internally feel that they are agents. [01:12:44] Speaker 2: Who was responsible for the deep learning revolution? We talked about how, you know, [01:12:51] Anil Ananthaswamy: the backpropagation algorithm in the mid 1980s became a big deal because that's what allowed us to train deep neural networks, neural networks that had more than one hidden layer. But it wasn't enough. Even though we could, we had the mathematics now to train deep neural networks, we couldn't do anything particularly effective with them. Because at that time in the, in the mid to late 1980s and even through the 1990s, the amount of data that we had that we needed to train these neural networks was very small. We just did not have enough data. And that had to change. And that did change by somewhere around 2007, 2008 onwards, you know, one of the first big data sets that came about was the ImageNet data set, which was, I forget, millions of images and, you know, all annotated by humans, you know, lots and lots of different categories of images. So we finally had a very, very large data set on which to train the neural neural networks. But so we had backpropagation algorithm in place, we had large, a large data set in place. The one other thing that was missing was, you know, these, the training of a neural network is computationally extremely expensive. It takes an awful long time to train these things. And people around 2010 started noticing that instead of training these neural networks using CPUs, central processing units, there's a much better way to train them. And that is to use these graphical processing units, which were actually designed for gaming. They had, they were not built and designed for training neural networks, but people realized that they could co-opt GPUs to train these systems much faster. So it was a combination of, you know, the backpropagation algorithm, which was, you know, fairly old by then, then the advent of really large amounts of training data, and the ability to use graphical processing units for training them. All these things came together, and I think it was 2011 or so when the first deep neural network named AlexNet kind of finally broke through and, and showed how it could do image recognition better than anything else that existed [01:15:20] Speaker 2: before. Anil, your book is a tiny bit connectionist leaning, although not entirely. But what do you think about some of the other methods in AI, like, for example, you know, symbolic methods, [01:15:31] Anil Ananthaswamy: evolutionary methods, biomimetic methods, etc? So, so my book is, I'm not sure I would say it's connectionist centric. It's a machine learning, it's a book on machine learning. So there are, you know, the history starts with the history of connectionism with the perceptron algorithm and also the Widrow-Hoff-Least-Mean-Squares algorithm, which are both algorithms that are used for training single layer neural networks. But then there's the whole intervening history of machine learning, which has nothing to do with connectionism. So ideas from, you know, whether it's a naive base classifier, the optimal base classifier, the k-nearest neighbor algorithms, the support vector machines, all of these are our principal component analysis and which is a statistical method that then can be used for, you know, unsupervised learning, etc. All these are, you know, very important and are non-connectionist. But yes, it's true that the latter half of the book kind of focuses on the recent developments. By recent, I mean in the last two decades where the focus shifted back to neural networks. Is it Hinton-centric? Hinton is a character in one of the chapters. I mean, the back propagation algorithm is really about the Rumelhart-Hinton-Williams paper. So in that sense, it is, you know, Hinton is front and center in that chapter. And then he reappears in the chapter on convolutional neural networks because of AlexNet that was his team's breakthrough. Those were kind of unavoidable milestones. I don't think there's anything more about Hinton in the book. Because the book is really about machine learning, it really doesn't deal with symbolic AI. By symbolic AI, I'm assuming you're talking about the kind of AI that preceded machine learning used these days kind of is called good old-fashioned AI. And, you know, the problem with symbolic AI, while it was very good at what it did, it couldn't learn about patterns that exist in data by just simply examining the data. So it required a lot of human effort to make it work. It was very brittle. But, you know, symbolic, the ideas from symbolic AI are really going to be very important if we are going to get machines to reason. And I do think that the things that are coming now are going to combine the abilities of deep learning systems to learn about patterns that exist in data. And on the back end, we might have symbolic architectures that allow us to reason about those patterns in ways that we humans seem to be capable of doing. So I don't think they should be thought of as either-or systems. They are going to be put together in ways that we don't quite know yet how to do fully. There are already ongoing attempts and those, you know, that the entire field is called Neurosymbolic AI, where you're taking the connectionist approach and the symbolic approach and putting them together. So I'm for that. I think if it helps achieve, you know, systems that can actually do the kind of abstract reasoning that humans can, why not? Biomimetic evolutionary algorithms, you know, searching over the space of possibilities, which is what evolutionary algorithms do so well, will also be a part of, you know, searching for architectures of deep neural networks that work better than others. Biomimicry is already in place. I mean, you know, convolutional neural networks, the inductive biases that go into building convolutional neural networks are already inspired by, you know, what we think of our visual system, the human visual system, and even artificial neural networks. The artificial neuron is very, very loosely inspired by what a biological neuron does. So biomimicry is already an integral part of how things happen. That's only going to get more and more important. For instance, we need to figure out why our brains are so much more energy efficient than artificial neural networks. Artificial neural networks, deep neural networks of today consume ridiculous amounts of energy to do something that is still way less than what our brains are capable of. And our brains are doing this with some 20 watts of power. And part of the reason, one of the reasons, not the entire reason, but one of the reasons is that our neurons are not firing all the time. They are what are called spiking neurons. So, you know, inputs come into the neuron, the neuron does some computation, and every, you know, every now and then it will send out an voltage spike. That's a very different kind of functioning than what is actually happening in artificial neural networks today. So if we get inspired by these spiking neurons in biological systems and learn how to build them in hardware, and we build, let's say we build spiking neurons in hardware, and we figure out how to train them and how to, you know, do inference with them in hardware. Well, that will be a huge leap in terms of energy efficiency. And, and that would very much be a, you know, a biomimicry idea. [01:21:12] Speaker 2: It's a big responsibility to write a history of the field in a book. And of course, many different folks have wildly different histories of the field, like, for example, you again, Schmidhuber, although I do appreciate that you did, you did get some some input from you again, in writing this book. What are your reflections on that? [01:21:31] Anil Ananthaswamy: First off, I agree that, you know, we have to be, as writers, responsible to the history of the field. And we have to do our best to capture it as accurately as possible. Saying that my intent in this book was, first and foremost, to capture the mathematical ideas. And those are not that different, you know, across different ways of looking at the history. So, the, the, once I identified what the math was, that I needed to explain, then finding the stories to anchor those mathematical ideas was important. And, you know, I chose a certain set of people to interview and, and help underpin the narrative. But I do, for instance, I agree that Schmidhuber, Hürgen Schmidhuber has, you know, contributed enormously to the field. It would be impossible to do an exhaustive narrative of all the different things that all the different people in machine learning have done over the past decades. Already, my book, for instance, is about 450 pages. And, and so the way, the way I approached it was to tell the story of certain developments through the lens of a few people, but then try very hard to make sure that the others get acknowledged too, right? So, for instance, Schmidhuber is acknowledged in the book as someone who has contributed to LSTMs, these recurrent neural networks. It's just that I don't talk about recurrent neural networks in my book. So, I don't, you know, delve into that deeply, but I do mention Schmidhuber's contribution. Even convolutional neural networks, which is of, and the use of GPUs, is often attributed to Hinton and others as having made it de-riggered to use GPUs. And, you know, AlexNet was the one that used GPUs and made it very popular. But Schmidhuber had done that earlier, too. He may not have done it at scale, but certainly the ideas were there in his paper, and I made sure I acknowledged that. Or if you take the, you know, backpropagation algorithm, and again, Schmidhuber's pointing out that Sepole Dynema had come up with the ideas for coding efficient backpropagation. I tell the reader that, okay, you know, there are these resources, you should go look it up. So, that was my approach, to try and make sure that anytime there was an alternate viewpoint that was, that warranted mention, I at least mentioned it. But then, in services of the book, which is about the conceptual aspects of math, I still had to find a narrative that you'd do one way of [01:24:30] Speaker 2: telling the story. What are your thoughts on scaling laws with respect to how we continue to improve AI systems? I mean, do you think that we will hit any theoretical or mathematical limitations as we continue to scale this technology? So, the scaling laws that we have right now [01:24:47] Anil Ananthaswamy: about the behaviour of deep neural networks, these are empirical scaling laws, in the sense that we have observed the behaviour of these systems and we have figured out that their behaviour kind of follows a particular set of laws. There's no underlying deep mathematical understanding of why these laws are what they are. Given that, it's really hard to say whether these scaling laws will keep holding as we make these systems bigger and bigger. You know, if there was a real hardcore mathematical result that says that yes, absolutely, then yes, you would expect things to continue. But right now, these are empirical results and it could very well be that we'll find out in a year or two that if we keep making these systems bigger, that their performance may not scale the same way as it has been so far. Things might saturate and it's, you know, often times when we have such scaling laws in other systems, we eventually notice saturation. That things improve according to some power law up to a point and then at some point they stop, you know, there's a law of diminishing return. So, given the lack of exact mathematical sort of results, it's very hard to say, okay, this trend is going to continue forever and ever. [01:26:19] Speaker 2: - Are there any clear computational limitations to the deep learning paradigm? [01:26:23] Anil Ananthaswamy: - I think it depends again on what you want your deep learning system to do. If, for instance, we are asking the question, are deep learning systems going to be capable of a certain kind of reasoning? You know, let's say the kind of reasoning that humans can do, which is to take a complex task and break it up into small subtasks and then apply these subtasks in clever ways to achieve a perfect result. this is something called compositionality. And will deep learning systems get there just by using the techniques we have so far for training them using, say, even self-supervised learning? Probably not, because there are already some mathematical results that are showing that there might be a inherent mathematical backstop to how much compositional sort of compositionality can be done by by these, for instance, these transformer-based architectures. So there might be mathematical limitations. Again, without a complete understanding of why these neural networks are doing what they're doing, it's always hard to make unequivocal claims about what they might or might not be able to do. And I think we have to remain a little bit open-minded about it. I mean, for me, the thing that I keep coming back to in my mind is nature has evolved biological neural networks, us, our brains. And even if we have very, very sophisticated forms of reasoning, all that is an outcome of evolution. No one has sat around wiring our brains up in a certain way. Evolution has discovered it. Evolution has discovered these solutions. Is the architecture of our biological neural networks the same as that in these artificial ones? Absolutely not. There are so many more complications in biological systems and we are nowhere close to approaching that complexity in artificial systems. But our brains are a proof of principle. It's been done once. It's been done by nature, not by us. It's been done over evolutionary time, you know, but yet it's been done. So is there any reason in principle to expect that deep neural networks won't get there? Not an in principle reason. Will it be possible as an engineering thing? Probably not. I don't know. It will require breakthroughs and we don't know what those breakthroughs are yet. [01:29:23] Speaker 2: You recently did a talk, ChatGPT and its ilk, about the theory of mind experiment with Alice and Bob. What does it tell us about the capabilities of ChatGPT? [01:29:32] Anil Ananthaswamy: Yeah, you know, I have played around with ChatGPT asking it theory of mind questions. And even though I know that it's simply doing next word prediction, some of these questions can be posed in very complex ways. And the output it generates seems to suggest it has the ability to model the minds of others, right? I mean, but because you know what it's doing, you know, behind the scenes, under the hood, you realize that it couldn't possibly be doing anything more than sophisticated pattern matching, right? But if you just look at the output, there is no denying that if all you had was the output to go by, you would be hard pressed to say that it hasn't got the ability to reason, that it is showing glimmers of being able to reason. So that's the, I think that's the, that's the problem. If you only look at the behavior and you don't know anything about what's behind, you know, the curtain or under the hood, I don't know how you're going to say it's not reasoning. But once you peek under the hood, once you know what it's doing, you become much more skeptical. And also it's very easy to break the systems, right? You can, you can ask them some very simple reasoning questions and they fail miserably. So it's very clear that they don't have sophisticated reasoning abilities. It's just that sometimes they seem to have that and it takes us aback. You spoke about the potential risks of AI, including job [01:31:15] Speaker 2: disruptions and the entrenchment of societal biases. What steps do you think need to be taken to mitigate these risks? And what are the societal effects of AI? I think that there are some near term societal [01:31:29] Anil Ananthaswamy: effects that we really need to be concerned about. You know, remember that machine learning systems are essentially learning about patterns that exist in the data that we provide. So if the data that we provide has biases built in, you know, whether it's of like, let's say you're trying to build a system that analyzes resumes or CVs and, you know, traditional hiring patterns and companies have always been sexist and racist and all this, all of the other concerns that we traditionally have to fight in society. If we teach machine learning systems with data that is inherently biased, they will exemplify those biases. There is no mystery there, right? And also there's always an assumption in machine learning that, you know, the data that you have trained the system on is drawn from the same underlying distribution as the data you're going to test it on. And if those two distributions are different, you know, let's say your training data was drawn from a certain data distribution, but your test data, the one that you're testing your system on in real life, in the wild, is being drawn from some other distribution, then all bets are off as to what that machine learning system will do. So there are a lot of assumptions that are baked in. So biases that are in the data get might get baked into the machine learning systems, then the problem, it's one thing for humans to make biased decisions. And because we have the ability to question ourselves as humans, we have checks and balances, hopefully, in place where if a human being makes a decision that is seemingly sexist or racist or anything else like that, we have, hopefully, ways in which we can mitigate that. The problem with machine learning systems is it's not often obvious to people who are using it, is that there is implicit, implicit uncertainty or explicit uncertainty in the, in the way these algorithms are functioning, except that when they produce the output, the output is always seen as being certain and the right answer, or, you know, there's just only one answer to be had, but there's just one answer to that. And under the hood, that's not what's happening. And this lack of uncertainty, or rather putting it differently, this seeming certainty about the answers that machine learning systems provide can be a problem. Like, for instance, if you take something like ChatGPT, there, there are a couple of researchers from UC Berkeley, Celeste Kidd, who's a psychologist, and her colleague, they made the point that when humans interact with large language models, and when they're asking large language models questions, it is in the nature of human psychology that we are at our most vulnerable when we are asking questions and we are receptive to answers. So if you have a large language model that gives you wrong answers, but does so with extreme confidence, which is the nature of its output, then because humans who are asking it questions at that point in time are psychologically receptive to that answer, they will very likely get influenced by these confident seeming answers, right? And, but once those answers are incorporated into our psychological makeup, we become less able to change our views. It's almost like there was a window of opportunity where we were pliable and willing to take different kinds of answers. And if you have a large language model that's giving you an answer, and it's wrong, and we have we have no way of telling, we will get influenced because we are receptive at that point [01:35:18] Speaker 2: in time. So these are all issues that we need to be worried about. You have compared the number of connections in a neural network to the number of connections in a human brain. Do you think that [01:35:28] Anil Ananthaswamy: this comparison is meaningful? So the number of connections in the largest large language models today is probably about a trillion. I mean, anywhere from half a trillion to a trillion, or maybe even more now. Compare that to the human brain, which in a very simplistic account of the number of synapses in the human brain, we stand at about 100 trillion. So a large language model, even the largest one, is about two orders of magnitude less in terms of the number of connections that we think is there in the human brain. That's a big number. But when we talk of the connections in the human brain, we don't take into account a whole bunch of other complexity, complexities that exist in the brain. For instance, we don't talk of neurotransmitters, neuromodulators. We don't talk of the fact that there's a whole bunch of computation happening in the dendrites, which are feeding input to the neurons. We don't fully understand what kinds of computations are happening within a single neuron. So there's probably orders of magnitude more complexity in the human brain than we can just infer from looking at the number of connections. So in that sense, large language models are far, far away from being able to capture the complexity of the human brain. But there's a reverse way to look at it, which is that even though the large language models are orders of magnitude away from the complexity of the human brain, they're already able to do some pretty amazing things, right? Now, you imagine a situation where we are able to scale up these artificial systems to the to the level of complexity of biological systems. Not only do we scale them up, but we somehow make them energy efficient, which right now is proving really difficult. But let's say we're able to make them energy efficient, so that even at scale, they're not consuming, you know, inordinate amounts of power. So we have artificial systems that are approaching the complexity of the human brain, but are also getting more energy efficient, then coupled that to the fact that these artificial systems have access to almost any information that we can feed them. Our human brains are not capable of that. You and I have limited access to information, right? So you take the power of silicon, you take the amount of memory that we can give to these machines, you scale them up to the complexity of human brains. That's what makes me pause and think that we are only just beginning with AI. Can you tell us about your work in the science of the self? The second book that I wrote, The Man Who Wasn't There, that book was an exploration of the human sense of self. And essentially, in that book, I look at eight different neuropsychological, neurological conditions. Each of these conditions kind of disturbs our sense of self in a different way. And the entire thesis of the book is that by looking at the different ways in which the self comes apart, and by self, I just mean the way we internally feel about ourselves, the way our body feels to us, the way our stories feel to us, the way we think of us as being here and now, or existing over time from our earliest memories to imagined future, all of that goes into this idea that, you know, of being an identity, of being a person, of being this thing that exists in space and time. So the thesis of the book was that, okay, let's look at the ways in which we come apart, not entirely, but, you know, parts of it come apart. And then, can that tell us something about the way this complex thing called our self is put together in the first place by the brain and body? So that was the, you know, impetus for writing that book. It was an exploration of the human self. [01:39:48] Speaker 2: You discussed various neuropsychological conditions that provide insights into the nature of self. Which condition do you find the most intriguing and why? [01:39:58] Anil Ananthaswamy: Well, I had eight different conditions in the book. And honestly, each one of them, because it affects a very different aspect of our sense of self, is both important and intriguing in its own right. Like, so it's really hard to say that any one condition was the most intriguing, but maybe in terms of how otherworldly it was, probably Cotard's syndrome was the most intriguing because, you know, René Descartes, the French philosopher said, "I think, therefore I am." And in Cotard's syndrome, you can almost legitimately make the claim that they can say, "I think, therefore I'm not." And the reason for saying that is people with Cotard's syndrome actually are convinced that they don't exist. And this is such a deeply felt delusion, which is completely immune to any kind of rationalization. You can't talk them out of it until it resolves. So while it lasts, the delusion is almost unshakable to the point that they will actually start planning their own funeral, right? And we know a little bit about why that might be the case now. Not the funeral planning part, but the fact that they actually think that they don't exist. So there is some neurological evidence to suggest that, you know, there are certain key brain areas that are being affected because of which they feel like that. But to me, the reason why it's intriguing is you can be an "I," the subject of an experience. You can be a self that says, "I exist," but you can also be a self that says, "I don't exist." And it raises the fundamental question, who or what is that "I" that is making that statement? In one case, it's making the statement, "I exist," like Descartes would have said. And in another, you know, in another situation with Cotard, the same "I" is making the statement, "I don't exist," and is equally convinced of not existing, as the former is convinced of existing. [01:42:09] Speaker 2: You spoke about Alzheimer's disease and its effect on our narrative self, which was the terminology you used. How does this inform our understanding of identity and personhood? [01:42:20] Anil Ananthaswamy: I think Alzheimer's disease is probably the most poignant and devastating of these conditions because, you know, if I were to ask you, "Who are you?" You're very likely going to give me a story about yourself. You're going to tell me who you are in the form of a story. And these are stories that we tell ourselves and others about who we are. And these stories change depending on the context. You might be a different story with your parent, and it might be a different story with a certain set of your friends, but nonetheless, you know, we are stories. And what Alzheimer's is telling us is that even when these stories disappear, which is what happens in Alzheimer's, because in Alzheimer's, you have short-term memory loss, you don't form short-term memories. So as a consequence, if you just had an experience and that experience never entered short-term memory, the consequence of that is it doesn't enter long-term memory. It doesn't become an episode in your story. So your story kind of stops forming as Alzheimer's sets in. And eventually, Alzheimer's basically destroys your story. You're unable to, you know, you're unable to be your story, whether that story is just cognitive or a story that's in your body, right? Like, for instance, if you're a conductor of an orchestra, you may lose a certain amount of cognitive skills because of Alzheimer's, but there is an aspect of yourself that is embodied, that if you were standing in front of your orchestra, you could potentially just conduct the orchestra without being able to cognitively say anything about it. So there's a lot of self-hood that is embodied, but all of that goes away. And one of the important philosophical arguments for a long time was that the reason why we feel like we are an I, like capital I, the reason why we feel like we are the subject of different experiences is because that sense of the I comes about from these narratives. It's almost like the brain is creating these swirling narratives and we are at the center. But the center is nebulous. It's not there. It only appears to be so because of the narratives. There was this philosopher, the late philosopher Daniel Dennett, who had a beautiful phrase to talk about this. He called the self, the experiencing self, the center of narrative gravity. And it's analogous to the idea that physical systems have a center of gravity. Like any physical object has a center of gravity. But if you go looking for the molecule or atom that represents that center of gravity, you won't find anything. It's just a property of the entire system. And so for Dennett, our self was also a property of all these narratives that are swirling around, you know, created by the brain and body. And if you took away the narrative, there would be no I. And it turns out Alzheimer's actually challenges that because in Alzheimer's you do end up losing your entire narrative. But you would be hard-pressed to say that even in end-stage Alzheimer's that there isn't somebody still existing who is not experiencing just, you know, bodily sensations. Because in Alzheimer's the sensory and motor systems of the brain are still intact. The cerebellum is mostly intact. So even though they can't cognitively recall their stories, even though their bodily selfhood has kind of gotten damaged, it's very likely that there is still somebody out there experiencing just being some minimal aspect of their body and that I hasn't gone away. So yeah, I mean, by just looking at how the narrative self comes apart, we are understanding that the self is more than just the narrative self. You discussed the concept of body ownership [01:46:19] Speaker 2: in respect of this condition xenomelia. How does that affect our understanding of embodiment and ourselves? [01:46:28] Anil Ananthaswamy: I mean, like all the other conditions in the book, each of, you know, xenomelia or what it used to be called before body integrity identity disorder is telling us that something we take as implicit is actually something that the brain has to construct moment by moment. So if you were to just, you know, look at your arm, look at your arm, you would have no doubt in your mind that this is your arm. There's an implicit sense of ownership of your arm. It's even a silly question to be asking, you know, is this your arm? Of course, it's my arm, right? There's, I don't think anyone would, in their right mind would question that feeling. But it so happens that in xenomelia, or BIID, people feel like some part of their body is not theirs. And we now again have some neurological evidence as to why that might be the case. But the point is that in order for us to feel like this arm is mine, the brain has to be constantly doing what it's supposed to be doing, which is imbuing our entire bodily self with a sense of mindness or ownership. And sometimes it fails. Sometimes it fails to do that for the whole body. Sometimes it fails to do that for parts of our body. And when that happens, it can become extremely debilitating because it's almost like some foreign object is attached to your body and you can't bear to have it there. It's like, it's, you know, if you were somebody who were, who was afraid of spiders, and if a spider was sitting on your arm, you would want to take that off. And your entire attention would be focused on that foreign thing sitting on your arm. Now if your arm itself was feeling foreign, and, but there's nothing you can do because it's your arm, it's functional, everything else about it is fine, except that it doesn't feel like your own. It's a very difficult condition to live with. And, but what it tells you about the self is that things that we take for granted, like sense of body ownership is actually something that the brain has to construct, that there is no, nothing fundamentally real about it. It's, it's just a kind of information processing that's happening in the brain. Sometimes it goes wrong. So you can be someone, you can be an I, the subject of an experience, who experiences an arm as their own, or you can be an I, who can experience an arm as not belonging to you. So the, again, it comes back to this idea that we still need to explain what the I is. What is your definition of agency? So in the context of the exploration of the sense of self, agency turns out to also be a construction. So, you know, we talked about this earlier, where, you know, if you pick up something, you have an implicit sense that you are the agent of that action, and you will that action into existence, right? It's just, it's a feeling we don't question. It turns out that there are brain mechanisms, things that make this feeling come about. It's not something that can be taken for granted. If you're, for instance, performing some action, the brain is sending motor commands to your arm to perform that action. But at the same time, the brain is sending a copy of those commands to other parts of the brain that are now predicting the sensory consequences of the action that you're about to take. And if the sensory consequences that have been predicted match up with what you actually feel, then that whole action is implicitly tagged as being done by you. So the sense of agency is in this way of thinking, a computation that matches the prediction against what actually happens. And if those two match, you are the agent, for some reason, if there was a mismatch, that action that you perform will not feel like you did it. This might seem strange, but this is exactly what happens in people with schizophrenia. So they might do the same action, but they won't necessarily feel like they are the agent of that action. So there is a disruption in this mechanism. It's called the comparator mechanism, the mechanism that compares the prediction against what actually happens. And if those two match, that action is tagged as being yours. And hence you have a feeling of being the agent of that action. Schizophrenia shows that that doesn't have to be the case. You can be someone who feels like they're the agent of the action, or you can be someone who feels like they're not the agent of the action that they just performed. So even the sense of agency is a construction. In this way of thinking, can AI models be agents? Yeah, if we computationally build this mechanism into AI agents, then we are essentially defining agency as this process. And if we build the necessary computational structure in place, then yes, we endow them with a sense of agency. Though sense of agency still involves this idea that we have a subjective experience of that, that there's an inner conscious experience. And I don't think anyone at this point would claim that AI models, even if you got the computational aspects of it sorted out, would claim that the agents are at this point feeling like they have a sense of agency. I don't know where that's going to come from, or how that's going to happen. Because whether or not that happens, really depends on your definition of what consciousness is. And that's a different rabbit hole, and a difficult one to get into. [01:52:15] Speaker 2: Anil, it's been a pleasure and an honour having you on MLST. I'm very sorry that I wasn't there on the day. But I hope our paths will cross again, and we can do the interview as a tete-a-tete in the same room. Anyway, I hope you enjoyed the show, folks. Cheers. By the way, now is an amazing time to tell you that we have a Patreon at patreon.com/mlst. It's pretty cool over there. We have a private Discord. We release early access versions of the shows. Many of the best shows that you've just been watching on our channel were there on the Patreon months ago. We have bi-weekly meetings with myself and Keith, and we talk about all of the stuff that we're doing. Of course, you can influence us on interesting guests to invite, etc. So please give us some support. Head over to patreon.com/mlst. Cheers.

Related Transcripts from Machine Learning Street Talk

Transcribe Any Video or Podcast — Free