About this transcript: This is a full AI-generated transcript of This is why Deep Learning is really weird. from Machine Learning Street Talk, published June 10, 2026. The transcript contains 20,356 words with timestamps and was generated using Whisper AI.
"Now, this book is called Understanding Deep Learning to distinguish it from other more practical books, which focus on things like coding. This book is all about the ideas which underlie deep learning. After reading this book, you will be able to apply deep learning to novel situations where there..."
[00:00:00] Speaker 1: Now, this book is called Understanding Deep Learning to distinguish it from other more practical books, which focus on things like coding. This book is all about the ideas which underlie deep learning. After reading this book, you will be able to apply deep learning to novel situations where there is no existing recipe for success.
[00:00:21] Speaker 2: The book starts straight away by describing deep neural networks, and it takes you through the training, testing, pipeline, how do we improve their performance. Then it starts talking about different architectures, convolutional networks, residual networks, graph neural networks, transformers. There's a long section on generative models, normalizing flows, VAEs, GANs, diffusion models, a short section on reinforcement learning. At the end, there are two chapters that I think are really interesting. There's a chapter called Why Do Deep Neural Networks Work, where I try and interrogate a bit why we need this particular kind of architecture, why it's easy to train, why it generalizes. We don't really have answers to those things, but I present some of the evidence there is. And the final chapter is a chapter on ethics. I think the book will be useful if you know nothing about deep learning at all. It will take you from scratch to somewhere close-ish to the cutting edge. If you're teaching deep learning, it'll be an incredibly useful resource. It has 275 figures, most of which are new and represent things in different ways. It also has a whole bunch of Python notebooks. If you're part of the rank and file of machine learning practitioners or researchers, it'll fill in the gaps in your knowledge and maybe make you think about things in a different way. I think even my initial description of deep neural networks is a bit different to how they're usually described and I think you'll learn a lot from it because I am a member of the rank and file of deep learning practitioners and I learned a lot writing it. So I expect you will learn something too. If your name is Geoff Hinton or Jürgen Schmidt-Huber, I can see it might not be that useful to you.
[00:02:11] Speaker 1: Well, you never know.
[00:02:12] Speaker 2: Oh, you never know.
[00:02:15] Speaker 1: So the title is a little bit ironic because at the time of writing, nobody understands how deep learning models work. Literally nobody. Now, deep learning models, they learn piecewise linear functions. And as you'll know from our episode on the spline theory of deep learning, they chop up the input space into many, many little regions. In fact, most models have more regions than there are atoms in the universe. And frankly, it's a mystery. It's a goddamn mystery. How do these models generalize and how do they learn these functions? Nobody knows. So why does deep learning work? It's remarkable that the fitting algorithm doesn't get trapped in local minima or stuck near saddle points and that it can efficiently recruit spare model capacity to fit unexplained training data wherever they lie. Perhaps this success is less surprising when there are far more parameters than training data. However, it's debatable whether this is generally the case. AlexNet had 60 million parameters and was trained with 1 million data points. However, to complicate matters, each training example was augmented with 2,048 transformations. GPT-3 had 175 billion parameters and was trained on 300 billion tokens. There's not a clear-cut case that either model was over-parameterized and yet they were successfully trained. In short, it's surprising that we can fit deep networks reliably and efficiently. Either the data, the models, the training algorithms or some combination of all three must have special properties which make this possible. The efficient fitting of deep learning models is startling and their generalization is dumbfounding. First, it's not obvious - a priory - that typical data sets are sufficient to characterize the input-output mapping. Second, deep networks describe very complicated functions. And third, generalization gets better with more parameters. This surfeit of parameters gives the model latitude to do almost anything between the training data and yet it behaves sensibly. It's neither obvious that we should be able to fit deep networks nor that they should generalize. A priory, deep learning should not work, and yet it does. The success of deep learning is surprising. In his book, Professor Prince discussed the challenges of optimizing high dimensional loss functions and argued that the over-parameterization and choice of activation function are the two most important factors that make this tractable in deep networks. He showed that, during training, the parameters move through a low dimensional subspace to a family of connected global minima and that local minima are not apparent. So, as we over-parameterize these models, generalization increases, but it's also related to other things like the flatness of the minimum and inductive priors. It appears that both a large number of parameters and multiple network layers are required for good generalization, although we do not yet know why. Many questions remain unanswered. We do not currently have any prescriptive theory that will allow us to predict the circumstances in which training and generalization will succeed or fail. We do not know the limits of learning in deep networks or whether much more efficient models are possible. We do not know whether there are parameters that would generalize better within the same model. The study of deep learning is often driven by empirical demonstrations. And Simon concedes that these are undeniably impressive, but they are not yet matched by our understanding of deep learning mechanisms. On ethics, Simon said that it would be irresponsible to write this book without discussing the ethical implications of artificial intelligence. This potent technology will change the world in ways arguably not too dissimilar to electricity or the internal combustion engine, the transistor or the internet. The potential benefits in healthcare, design, entertainment, transport, education, and almost every area of commerce are enormous. However, scientists and engineers are often unrealistically optimistic about the outcomes of their work and the potential for harm is just as great. Simon argues that everyone studying or researching or even writing books about artificial intelligence should contemplate to what degree scientists are accountable for the uses of their technology. He said that we should consider that capitalism primarily drives the development of AI and that legal advances and deployment for social good are likely to lag significantly behind. We should reflect on whether it's possible as scientists and engineers to control progress in this field and to reduce the potential for harm. We should consider what kind of organizations we are prepared to work for. How serious are they in their commitment to reducing the potential harms of AI? Are they simply ethics washing to reduce reputational risk? Or do they actually implement mechanisms to halt ethically suspect projects? Simon invites readers of his book to investigate these issues further. It's undeniable that artificial intelligence will radically change society for better or worse. However, optimistic visions of a future utopian society driven by AI should be met with caution and a healthy dose of critical reflection. Many of the touted benefits of AI are beneficial only in certain contexts and only to a subset of society. Now, the book cites Green 2019, who highlighted that one project developed using AI to enhance police accountability and alternatives to incarceration and another developed to increase security through predictive policing are both advertised as AI for social good in big air quotes. Assigning this label is a value judgment that lacks any grounding principles. One community's good is another's harm. Ethical AI is a collective action problem and the chapter concludes with an appeal to scientists to consider the moral and ethical implications of their work. Every ethical issue is not within the control of every individual computer scientist. However, this does not imply that researchers have no responsibility whatsoever to consider and mitigate where they can the potential misuse of the systems they create. We scuba dive, you know, we actually do stuff and again, maybe I'm just being a human chauvinist, but I'm completely on board with it becoming embedded in our cognitive ecosystem.
[00:09:53] Speaker 2: Well, also GPT does stuff, not very well, but it does it.
[00:09:57] Speaker 1: Well, does it though? I don't know. Does it do something? Yeah, like in the way a cat flap does stuff, you know, you can make it do things that can execute code and so on. But I wouldn't say it has agency. Hello, everyone. It's Tim here from MLST, your go to channel and podcast for all things machine learning, AI and philosophy. Today I'm reaching out with a special request. As you know, creating content for MLST takes a considerable amount of time and resources. It's a labor of love that I do purely for the fun and passion of the subject. But to keep bringing you the high quality content that you've come to expect, I need your support. Please consider supporting us on Patreon. Every bit helps, although please only donate what is an insignificant amount of money depending on your situation. If you can't afford it, just let me know and I will give you free access to the Patreon's benefits, no questions asked. Anyway, thank you so much for your time and I can't wait to welcome you to our Patreon family. Signing off for now. Cheers. Simon, it's an absolute honor to meet you. Welcome to Machine Learning Street Talk.
[00:11:07] Speaker 2: I'm very glad to be here.
[00:11:09] Speaker 1: So tell us about yourself.
[00:11:11] Speaker 2: Well, I actually started my career in psychology. My PhD is in psychology and then I've been wandering through various parts of science. I worked in neuroscience for a while. I did some early work in augmented reality. I dabbled in medical imaging. In the noughties, I was faculty at UCL and I worked on computer vision. And I'm probably best known for a book that I wrote at that time. And in the last decade, I've mainly been working in industry, in finance and computer graphics. And I'm currently a professor at the University of Bath, where I have been working on a new book, Understanding Deep Learning, which is going to be published by MIT Press. Sorry, I should say, which is published.
[00:11:58] Speaker 1: Well, Simon, we were joking before that on your last book, you were Ubered. And you were writing a book about computer vision and probabilistic graphical models and so on. And then that was that Sutskeva guy. We'll come back to him later. Well, it wasn't him, actually. It was Krzyzewski. But they were all basically Hinton's guys. They released AlexNet, didn't they? And then that was computer vision completely solved.
[00:12:23] Speaker 2: Right. So my last book was a really ambitious attempt to basically remold the whole of computer vision as I saw it by formulating what was quite an ad hoc selection of methods in terms of probabilistic graphical models, which wasn't at the time necessarily how everybody thought about it. And in 2010, I went on sabbatical to the University of Toronto. And I worked on this book, probably with Alex Krzyzewski in the next room to me, sharing an office with Jeff Hinton's postdoc, doggedly worked on this book. I released it in 2012, a few months before AlexNet came out, and the entire field took a right-angle turn, leaving my book in the dust. Although I still think it's quite useful, the geometry and stuff is still definitely all valid, and it has a lot of stuff on Bayesian probability. But this book is less ambitious. It's a more sort of straightforward description of where we are with deep learning. It's supposed to be the spiritual successor to Goodfellow, Bengio, and Corville, which was published in 2016. So obviously a lot of stuff has happened since then. It takes a sort of pragmatic middle ground between very theoretical stuff with lots of proofs and very practical stuff with lots of code. There are no proofs. There is no code. It's about the ideas that drive deep learning.
[00:13:52] Speaker 1: OK, so Simon, when you started writing this book, what was the main idea that you had in your mind?
[00:13:59] Speaker 2: Well, I think the history of deep learning is that experimentalists have run far ahead of the theory, and we now have this explosion of papers where there is literally an exponential increase in the number of papers being published. And when I say literally, I mean literally, there is a plot in a paper that came out last year where on semi-log Y, it is a straight line with 4,000 papers being pushed to archive every month. Yes. And presumably there is more than that now, obviously it can't increase exponentially forever, there is a finite number of humans on the planet, they can't all do machine learning research. So there is a staggering quantity of information out there, if you are a new person coming to machine learning or you want to learn something new, it's almost impossible to find good resources. People are learning things from blogs that are written hastily by people who don't always know what they are talking about. And so it seemed to me a really useful thing I could do for the community where I could write out everything pretty much important that's happened in the last 10 years connected to deep learning with the same notation illustrated in a modern way without regard for history. I don't start at the perceptron, I jump right in to deep neural networks and you know, 20 pages into the book, not 160 pages, just to save collectively the community a giant amount of time.
[00:15:29] Speaker 1: Yes. And do you think that deep learning is alchemy?
[00:15:34] Speaker 2: No, it's not alchemy at all. I think in the future, what we'll think of it as is the science of modeling functions and probability distributions in very high dimensions. I mean, I think it'll be recast as that in terms of science. At the moment, we're more concerned of the way we organize our whole community is about results. So we don't really talk about it that way. But I think in 40 years time, they'll look back and say, well, in the 2010s, they studied how to build functions and how to model probability distributions in dimensions that are higher than, say, 50.
[00:16:09] Speaker 1: Yeah. So I was tongue in cheek on the alchemy point. But I guess number one, people have made some strange analogies to neuroscience and biology. Even the word neuron actually is quite interesting. So it only appears four times in your entire book. And two of those occasions, you are counseling us not to use the word neuron. So let's start with that.
[00:16:31] Speaker 2: Yeah. And I should say, I will probably accidentally use the neuron during this conversation because it's so embedded in our community. But I think it's a terrible analogy. There's no evidence the brain works in any way that deep neural networks do. If you look at the sort of epiphenomena of the brain, you know, of human computation, we seem to have things like short-term memory. We seem to need to dream, to lay down memories. We have a modular brain with special parts for recognizing faces, for navigating through the world and so on. There's no evidence that deep learning has any of that. And likewise, there's no evidence that the brain has any of the epiphenomena of deep learning. So, you know, there's no evidence for double dissent or adversarial examples or lottery tickets in the brain, as far as I'm aware. And this is sort of okay for our community because we know what we're talking about. But now deep learning is becoming a really important thing in the real world. And so we're trying to communicate this to the general public. And we're talking about neurons and neural networks. And that carries with it a lot of baggage. You know, an interesting experiment that everybody watching this should do is go and talk to somebody at a dinner party who knows nothing very much about AI, who works, you know, a lawyer, someone intelligent, who works in a completely different field and ask them what their understanding of current AI is. And almost certainly the answer you will get is that they have no idea. They might be able to give you a couple of buzzwords, but how does it work? They have no idea. And I really disapprove of the neural metaphor just because it comes with a lot of baggage. That sort of implies perhaps that the network's having thoughts or that it's something like us. And that's deeply misleading to people who are outside our community. And of course, everything we're doing is increasingly affecting those people outside our community. And we want to give them sensible information about what it is that we're working on.
[00:18:34] Speaker 1: Yeah. So, I mean, again, on the alchemy point, we are now dealing with multiple levels of emergence. And what I mean by that is people understand gradient optimization. They understand parameterized models. They don't understand the emergent phenomena and they reach for analogies, let's say psychology analogies. You know, they talk about things like theory of mind.
[00:18:55] Speaker 2: I'm not completely convinced there are emergent phenomena, depending on exactly how you define emergent phenomena. For me, an emergent phenomena would be something where you gradually ramped up the scale and there was suddenly a phase change where you could see new phenomena, I suppose, for want of a better word, suddenly appear with scale. And I'm not sure we've done those experiments super thoroughly. What you're really saying to me is that the statistics of the data on the Internet are surprisingly rich, such that we can, you know, complete sentences or translate from one language to another. All of that is just reflecting the statistics that are on the Internet. And it is surprising that when you gather that much data and put it together in a network and reproduce those statistics, we see these phenomena. But I don't see that as a property of the network. I see that as a property of the data that we're putting into the network.
[00:20:00] Speaker 1: Whether it came from the richness of the statistics or not, I think we are looking for mental reference frames to understand said phenomena. It's just we need a way to understand this stuff, right? And psychology has a theory of mind, literature, that we can use. And we are almost certainly overloading it and bastardizing it. But are you saying we shouldn't do that?
[00:20:23] Speaker 2: I have the most reductionist view of this possible. I think of when I see a large language model, I see an enormous equation with trillions of terms in it. And we've set the parameters of those terms so it performs some kind of behavior. And I don't think there's any great meaning to that behavior. It's an equation. There are inputs. You compute, you add things, you multiply things occasionally. If it's a transformer, you take an exponential. Yeah. And out comes some numbers at the other end that you then translate into words. And I don't think it can be understood on any deeper level than that.
[00:21:01] Speaker 1: But couldn't you make the argument then that there's nothing special about our mental states or our behavior because we are also at some level just performing simple calculations?
[00:21:12] Speaker 2: I think you could make that argument that there's more going on because we have a larger variety of brain systems interacting with each other, some of which have been built on top of each other during evolution. So there's more structure in the actual model. You know, it's not, at the very least, the human brain is not one equation. It's a bunch of equations interacting with each other in a complicated way. So whether that maps to the kind of constructs you're talking about, I don't know. But I see the human brain as being quite a different thing to that.
[00:21:46] Speaker 1: It's mostly a matter of complexity. So when you have this rich kind of functional dynamics of things interacting in the physical world, you have the emergence of agency and all sorts of moral status and so on. And you're basically just making the statement that in neural networks, we're just nowhere near to that kind of emergence.
[00:22:05] Speaker 2: I mean, you're asking me questions that nobody knows the answer to. The only existing model we have that works is the human mind. And that seems to work upon quite different principles than just scale.
[00:22:19] Speaker 1: Yes.
[00:22:21] Speaker 2: I think something that's very interesting about transformers, though, is I was arguing before that I don't like this neural analogy. But actually, the large language models like ChatGPT are the closest thing we have to something like the human brain, because they don't, they do just, as I expressed, map an input to an output. But they also have this kind of short-term memory, which is the context window. So in a sense, and it's ironic that we don't refer to that in terms of neural analogy, because that to me is the first thing we've built that sort of looks something like the human mind. So in a sense, you have this context, you have this context, and it makes the next prediction of the token based on that context. And in principle, you could then operate on the previous context, the, you know, the system itself could operate on the previous context. It could summarize it. It could file things away. It could ask itself to generate other, you know, different hypotheses to explain something and compare them and decide on something. And use that as a sort of scratch memory in the same way that we have a working memory. But strangely, we don't refer to that in terms of the neural analogy, which I find quite ironic. I don't know if people are working on that kind of thing. I assume that people are working on everything, but I haven't personally read any work in which the Transformer system goes back and edits things in its past context. Yes. But I assume that that would be one direction that you might go to try and make this system do something that's more like thinking. I mean, in the end, a purely feed forward system can't really do anything sophisticated. You need probably to be doing some kind of manipulation, if only to generate internal consistency. So there's no way a large language model can have internal consistency. It's learned everything on the Internet. It thinks that the Earth is both flat and round with different statistical proportions. You know, hopefully, mostly most people on the Internet think it's round and that's the conclusion it's come to. But in the weight somewhere is Earth flatness. And so to get to another level of cognition, you're going to need something that builds an internally consistent model of what's out there. Whether that needs, as you might argue, interaction with the real world or whether that can be done purely in the domain of language remains to be seen. But I think that might be one direction that people would take things.
[00:24:59] Speaker 1: You know, the body of knowledge of humans is a kind of virtual phenomenon that supervenes on all of us physical Earthlings. So, you know, like this infosphere that we've created, it's like a symbiotic organism. And that has consistent artifacts of knowledge, as you said. But many humans do hold the view that the Earth is flat. It's just another example of this interesting kind of like levels of emergence.
[00:25:28] Speaker 2: But they hold an internally consistent view that the world is flat. I mean, as far as they're concerned, it's internally consistent. Obviously, there are inconsistencies that are quite easily proven. But within their mind, they explain, they have a model of the world whereby they explain everything. Well, if the moon is flat, well, obviously, the sun must be flat as well. And so must. And, you know, when you look at the horizon of the sea, it looks flat. And consequently, the Earth can't be round. They, you know, they explain away other phenomena and build up a model that backs up their hypothesis. And there's no sense that a transformer system is doing anything like that. It just starts at the beginning and predicts the next word and has the statistics that are consistent with what's previously in its context window.
[00:26:19] Speaker 1: So, yes, you could argue, though, that humans, our brains are also very chaotic, but we have this confabulation and post hoc rationalization in much the same way. So we, you know, subconsciously, we hold conflicting views. But when we try and explain our views and to avoid cognitive dissonance, we kind of we try and reduce what we think to something simple.
[00:26:41] Speaker 2: Right, but we have a finite number of views that are sort of partially rounded theories of the world. Yes. The large language model has everything that humanity's ever created with no preference for one thing or another other than its statistical likeliness. Yes, yes. So that's still, you know, even if you have multiple conflicting views of the world, you know, there's that famous Walt Whitman poem where he says, do I contradict myself very well? I contradict myself. I am multitudes. I am large. You know, that is human beings captured, but we don't have every view on everything simultaneously. We're trying on some level to come up with consistent models of the world. And we need to do that because we need to take actions in the world. And it's impossible to do that if if you have 50,000 conflicting views of how things work.
[00:27:32] Speaker 1: And this is really interesting because Hinton says one of the reasons why ChatGPT is a kind of super intelligence is because it knows all of the things. But I would argue, as you just did, I think that we are kind of bounded as observers, as there's a computational kind of restriction to how many things a single observer can understand at one time. And we'll get more into this later, but I think with cognition, it's not just knowing, it's also thinking. So just knowing everything isn't actually the whole piece, is it?
[00:28:02] Speaker 2: Yeah. Can you deduce new facts? I think in one of your other podcasts, you talked about if you trained ChatGPT with data only up to, you know, the early 20th century, would it be able to reproduce the Einstein's theory of relativity? I think we all know the answer to that, it wouldn't.
[00:28:21] Speaker 1: Definitely not.
[00:28:21] Speaker 2: And what are the missing pieces? But, you know, that's getting at what I was saying before. It's true to build that theory, you need to have a model of the world and you need to realize that model of the world is wrong. That certain facts, I don't personally believe you need to observe those facts yourself, but certain facts are inconsistent with that theory. And then you need to somehow come up with a new model that itself will make new predictions about the world that people can go and test in the case of physics. But I think that happens on a sort of more minor scale, you know, with your theory about how businesses work or how your friend's personality works and how best to interact with them. You know, you have theories about everything that occasionally break and you have to radically rethink them.
[00:29:10] Speaker 1: From a computational point of view, they are finite state automators.
[00:29:14] Speaker 2: I mean, you're saying any finite entity can only compute certain things by nature of the fact that it's finite. Eloquently put. Of course, that is true, but I'm not sure that that's a radical insight. I mean, what would be interesting is if we had some way to characterize what kind of things you could compute with a certain degree number of parameters or what have you. But as far as I know, we don't really know the space of functions between input and output. We can fully describe given a fixed set of parameters and a certain neural network architecture or what lies outside that because you're building this very complicated surface in multidimensional space and everything's dependent on everything else. So it's not really obvious, but it seems it's very rich in that we give it almost any data set we want and with enough capacity it can fit it.
[00:30:15] Speaker 1: Well, I think we're just about to put the pin in the middle of the dartboard here, which is that and we'll talk about universal function approximation later, which is that given an infinite number of neurons, you can approximate a function to arbitrary precision. So that's a little bit like saying if I have an infinite size hard drive, I can, you know, store any file that I want to. So the big discussion between connectionists and symbolic folks is the symbolic folks argue that we do need an infinite amount of computation for many things. Neural networks have told us that in many cases, no, we don't. And then I guess it's just about, well, are there boundary cases where we do need an infinite amount of computation?
[00:30:59] Speaker 2: I mean, you know, an interesting question you're asking in different ways. Are there ideas that the human mind can never grasp?
[00:31:07] Speaker 1: Or indeed that we can grasp that a neural network can't. Simon, we've now teleported to our studio. We were just outside freezing, freezing ourselves to death.
[00:31:21] Speaker 2: It is much warmer and less muddy in your studio.
[00:31:24] Speaker 1: Indeed it is. So, Simon, we'll get straight to the chase. I've been reading your amazing new book on deep learning. And I think we should start by talking about the first few chapters, really, in particular, chapter three and four, where you talk about not only neural networks, both, you know, deep neural networks and just single layer neural networks. But there's this kind of elephant in the room in general, I feel, about neural networks, which is why do they work so well? You know, the unreasonable effectiveness of neural networks. Why do they work so well?
[00:32:00] Speaker 2: Well, the good news is they do work really well. If you've been asleep for the last 11 years or so, they're incredibly effective. But it is a bit of a mystery. And I think it's particularly interesting to look at it from the perspective of just before AlexNet came out. So ImageNet at that time would have been considered a real stretch goal. Richard Silisky wrote in the Computer Vision book he published around that time that he expected it to be years and years before computers could see as well as a two or three year old. And so ImageNet is a challenging task. The input dimensions, two to four by two to four image, that gives you roughly 150,000 dimensional space. So there's roughly, you know, it works out as roughly 10 to the 150,000 possible images. You might say, OK, well, most of those images are noise. But presumably the actual manifold of images is still extremely large. And you've got to build a model that maps this to one of a thousand classes. And you only have a million examples in this 10 to the 150,000 dimensional space. And for each of those classes, you only have a thousand possible examples. So you can't even build a Gaussian for each class in that case. So if you didn't know that humans could do this task, you might even just give up. But since we knew humans could do this task, people persevered. And AlexNet set out on this extremely ambitious program that was completely different from what most people were doing at the time. I know that there's arguably some predecessors. I don't want to get into that. I will talk about AlexNet and you can judge for yourself whether it's the right thing to talk about. So AlexNet sets out to build a neural net with 60 million parameters, several orders magnitude larger than most of the computer vision models that would have been built at the time. And the way that I think about neural networks is they divide the input space into convex polytopes, each of which has a linear or more properly affine function within it. So if the input space is two-dimensional, then it divides the input space into convex polygons and each polygon is a sort of plane and those planes are organized so they make a continuous surface. But now we've got 150,000 input dimensions. So you're in very high dimensions and we've got this incredibly, our model makes this incredibly complicated surface. And it's difficult to count the number of these polytopes it creates, but just the fully connected layers at the end would make of the order of, you know, 10 to the 4,000. So you've got this huge space. You've got a model with 60 million parameters that creates a number of regions far larger than the number of atoms in the universe. So one of, you know, hardly any of those regions are ever going to see a training data point or a test data point at any point during training. And there's some super complicated relationship where you tweak one of your 60 million parameters and these much larger number of polytopes change and manipulate it in some very complicated, indirect way that's difficult to characterize. So now you come along and say, okay, I'm just going to do a 60 million dimensional optimization problem. At that time in computer vision, you know, people were considered ambitious if they were doing thousand dimensional optimization problems. They, you know, typically the biggest kind of problems would have been structure from motion problems. So you would probably try and find the solution in closed form, an approximate solution, and then just use the nonlinear optimization. You're already somewhere near the local, the global minimum, and you're just going to find, use optimization to get to the final place. But they're going to start by just randomly initializing this network and then using the dumbest algorithm, basically noisy gradient descent to get to the bottom. Because you can't use anything else because everything else uses second order information. And now the number of parameters is too enormous.
[00:36:30] Speaker 1: Sounds like a disaster.
[00:36:31] Speaker 2: It does sound like a disaster. I don't think many people would have predicted you could even learn the model. But one thing they have in their favor, and we can come back to this in a bit, is arguably there are far more parameters in their model than there are data points. So it's over-parameterized, and perhaps that makes the fitting easier. And there's a subtlety there that we can come back to. Yeah. So, okay, they fit the model. But as I said, there's 60 times more parameters than there are data points. So the surface they fit basically goes through every data point exactly. But for every data point it goes through, it can do 60 other things, 60 more degrees of freedom. So between those data points, which, as we've discussed, is almost the entire of the space, it has latitude to do whatever it wanted. And actually, they had applied some regularization and some dropout. But subsequently, we found that those aren't really critical to this. So at the time, you might have explained it away with that. But since then, we know we can learn models without dropout and without regularization. And they still generalize reasonably well. And somehow this model performs 15% better than the next best thing, or whatever it was. Quite surprising. I mean, jaw-dropping, really.
[00:37:51] Speaker 1: Yeah. Well, so let's do a point of order just on that then. So first of all, I'm a huge fan of this view of neural networks that you've just spoken about, which is that essentially it slices and dices the ambient space into these locally affine polyhedra. And Randall Bellistriero came up with this spline theory of neural networks. And folks should watch the first Lacoon show where we spoke about that in detail. And what you basically said was that for a given input example, the neural network can be represented with a single affine transformation, which is surprising to a lot of people for two reasons. First of all, people think of neural networks as being, you know, non-linear. And also, it gives this beautiful intuition of neural networks as being a collection of storage buckets, not too dissimilar to a locality-sensitive hashing table. And I think that's very instructive.
[00:38:43] Speaker 2: Right. And so I wrote this book blissfully unaware of that theory, I should say, but that's pretty much exactly how I describe it in this kind of simple, constructive way. How do we combine together these networks? We should say that this is only true for RELUs and leaky RELUs and parametric RELUs. If you have smoother functions, then you can no longer characterize it in this way. But actually, I like discussing it in terms of RELUs because then the number of output regions gives you some notion of the complexity of the output surface. Whereas once you start talking about smooth functions, it's much more hard to characterize that.
[00:39:21] Speaker 1: But it's kind of locally smooth because I guess for the folks at home, every single one of these neurons will, you know, maybe come back to that. It's kind of like a hyperplane. And so you train the network and you move all of these hyperplanes around in the ambient space. And essentially, when you put an input in, it's kind of activating some of those hyperplanes and it's creating this kind of convex region.
[00:39:49] Speaker 2: Right. According to which hidden units are active or not active.
[00:39:53] Speaker 1: Yes. And even though these are piecewise linear functions, there are so many of them kind of like, you know, densely overlapping each other in space. They appear locally smooth.
[00:40:05] Speaker 2: Presumably. Well, we can plot like one dimensional plots through the function and see what it looks like and what it looks like. And of course, it does look smooth because of the sheer number of them.
[00:40:17] Speaker 1: Right. But then we get to the next interesting bit, which is, you know, so you spoke about over parametrization and, you know, like finger in the air. Let's say that there's an exponential number of these convex polyhedra, you know, something along the lines of, I think you said two to the power of the number of dimensions.
[00:40:35] Speaker 2: And well, generally, it would be more than that. I was trying to get lower bound that nobody could disagree with.
[00:40:41] Speaker 1: But this gets to the fascinating thing. So traditionally, before neural networks, the world was different. And we used to talk about the curse of dimensionality. And that's basically this statement that there is an exponential relationship between the volume of the space and the number of dimensions.
[00:40:58] Speaker 2: Yeah, I would say it's the tendency of the volume of space to completely dwarf the amount of data you have as the number of dimensions goes up.
[00:41:06] Speaker 1: Exactly. So when the volume of the space does increase exponentially, the statistical significance of your training data kind of tends towards zero. So there is no statistical information anymore, which begs the question, if there's no statistical information, how do the networks work?
[00:41:25] Speaker 2: And now we're getting to the difficult part. Why does it generalize in any way? Yeah. It's clear that it can pass through every data point. But what it does between the data points is, I mean, in some ways, almost happenstance, it's a byproduct of our algorithms. It makes some smooth interpolation. Why that's a good smooth interpolation, we don't know. Why it even becomes smooth, we don't know. It seems like it's some complicated byproduct of the way we initialize the neural networks, the noisy algorithms we use to train the neural networks, the overparameterization itself. You know, perhaps the kind of thing that might be going on is we wind up with smooth networks that interpolate just because we overparameterize so much that when we set up our networks, we initialize them, and the wider they are, the smaller numbers we put in. And the numbers start small and they stay small because it never has any impulse to make them larger. And because the numbers in the network have small magnitudes, that basically maps to small slopes. I want to slightly come back to the overparameterized thing as well, because it's not totally obvious to me that everything is clearly overparameterized. So while AlexNet had 60 million parameters and 1 million training points, they also augmented their data by a factor of 2048. And if you look at all of the papers, pretty much, on this ImageNet classification problem, they're all sort of overparameterized by the best results in the last 10 years by sort of 10 to 100 times. But they also all augment the data, which muddies the water. So it's one way to think about it is that it's more data points and we're not overparameterized. But of course, they're not independent data points anymore. They're translations, rotations, color transforms of the input image. So I don't really quite know where we are with overparameterization.
[00:43:42] Speaker 1: Yeah, well, I want to get to the inductive priors in a minute, because I do think of these local affine polyhedra as being buckets. And then we get to, well, if you perform some semantically equivalent transformation, like a translation or a rotation, as far as the MLP is concerned, it's a different thing. So these inductive priors basically photocopy the information so that you're putting the same thing in different buckets, so that you're kind of like cheating. Maybe we'll get to that in a sec, but just to finish the loop on this generalization thing. What we're doing is we're complexifying the neural network. And this is really weird, because we were taught in school that Occam's razor is, you know, like simple things generalized. And now we are exponentially complexifying the networks. Why do they still generalize?
[00:44:37] Speaker 2: Well, it's interesting how the theory can't really cope with that. So previous ideas of theory, like Redamacker complexity, would have predicted the opposite. It would have predicted that as we added more parameters, the preferred generalization would have got worse. I mean, the kind of loose idea people have when they talk about double descent and stuff is there's more regions just means that you can model a smoother function. It has to go through the data points and it's smooth elsewhere. And we don't have a much better conception of it than that. You know, for things like images, we can put in some prior knowledge about images that the statistics are the same everywhere, use convolutional networks. So we're searching through a more sensible subset of models. You know, convolutional network is a sort of strict subset of a fully connected network. So we're searching through a more sensible set of models that have some prior knowledge we've put in. But that still doesn't really, in a satisfying way to me, explain why networks generalize so well.
[00:45:43] Speaker 1: Can we bring in this notion of the manifold hypothesis? How does that, because even before you were saying something really interesting, which is that everything's an inductive prior. You know, even the world we live in is an inductive prior, but just keep, you know, keeping the philosophy out of the discussion, data is definitely an inductive prior. So you can do data augmentation and you can kind of oversample effectively, you know, in the vicinity, or you can tell the neural network that doesn't know about certain types of transformation. You can kind of effectively tell it that the transformation exists by augmenting the data.
[00:46:14] Speaker 2: Well, yeah, there's two ways. You can build that into the network. So there's equivariant or invariant according to your needs, or you can just blast it with data. Yeah. Either, you know, transforming the output in the same way you transform the input if you want it to be equivariant, or ignoring the transformations if you want it to be invariant, that kind of transformation.
[00:46:36] Speaker 1: Yes. And there's an interesting relationship. We've got this bias-variance trade-off and, you know, we're doing statistics here. So in an ideal world, you would only need to give it one labelled example of the thing that you want to learn because the network would understand all of the, you know, transformations that can happen. That doesn't work. So there's a kind of, there's a middle ground between, on the one hand, we have this dumb MLP that knows nothing. And we're kind of introducing all of these, all of this, what would you call it, physics domain knowledge of how things can be transformed.
[00:47:06] Speaker 2: Domain knowledge seems a good term, yeah.
[00:47:08] Speaker 1: Yeah, yeah. Okay. Interesting. So we were talking about this manifold hypothesis. So that's this idea that there's some kind of a subspace in the data. And because theoretically, as you were saying earlier, according to the curse of dimensionality, it should be physically impossible because you would need more data points than there are atoms in the universe to train these models. So people cite the manifold hypothesis, which is to say there's some kind of subspace in the data where, you know, the network can kind of focus its attention on. What do you think about that?
[00:47:47] Speaker 2: I mean, in some case, there's a very simple case where you can almost describe that subspace, which is if you take somebody's face and you film them in fixed lighting from a fixed viewpoint and they don't move too much, then they have 42 muscles in their face. And there's only 42 things that can happen. And that's a very sort of physical description. Of course, the real world is more complicated than that. But there are still regularities. There's only certain kinds of materials out there. There's only certain kinds of lighting. Real images are not all images. Go, you know, go randomly select pixels and you see how many times you have to do that before you get something that doesn't look like noise. You will give up very quickly. We don't really know the size of that manifold, though, although you can estimate it because it's connected to the degree to which you can compress images. And so one thing that's really interesting to me about these new diffusion models is that, you know, they're not that big. You can put them on your hard disk, but yet they seem to be able to generate this incredible array of images so that we don't know what images they can't generate, of course, but a good portion, apparently, of the image manifold can be put on a hard disk, suggesting it's not actually, you know, the space of natural images is not actually that large.
[00:49:05] Speaker 1: Yeah, interesting. So you're saying the apparent creative and, you know, generative ability of these models suggests that there is some low dimensional manifold of images?
[00:49:15] Speaker 2: Yeah, well, there must be because we're not, you know, the generative model does not have 10 to the 150,000 parameters.
[00:49:23] Speaker 1: Yeah.
[00:49:24] Speaker 2: For the very good reason that, you know, we can't store even a tiny, tiny fraction of that.
[00:49:30] Speaker 1: Very interesting. Okay, so we should speak quickly about this, you know, universal function approximation. So if we start, first of all, with a shallow network, this universal function approximation theory basically says that, you know, if you approximate it to arbitrary precision, you can, you know, represent any function. And it's just a collection of basis functions, right? You know, we're just optimizing the placement of these basis functions.
[00:49:55] Speaker 2: Yeah, I mean, you can still think about it just dividing the input into convex polytopes too. That's, that's fine. Or you can consider it in terms of basis functions.
[00:50:06] Speaker 1: Right. So I've always thought that this universal function approximation theorem isn't particularly useful.
[00:50:16] Speaker 2: I mean, it is important that we know that you can represent, you know, with some caveats, you can represent any function you want to. What's really interesting is that you, to build anything that works really well, we seem to need 10 to 12 layers. And it's difficult to know, and there are quite a few different theories, why we would need a deep network, given the universal approximation theorem says a shallow network will do, can model anything.
[00:50:51] Speaker 1: Well, I mean, just to have a go at that. So that theorem is talking about a very wide, single layer neural network. And it seems to me that just placing basis functions on a single layer of a neural network, it's almost antithetical to generalization. It is memorization by definition. So you make this really interesting argument in your book, that when you introduce depth to a neural network, what you're doing is, is you're kind of progressively refolding those affine regions, right?
[00:51:26] Speaker 2: Yeah, that's one partial way to think about it. So one partial way to think about it is that if you have a two layer neural network, the first network can be thought of as folding and replicating the second network in a complicated way. And go look at a picture in my textbook, that's difficult to describe verbally, but that's only a sort of partial way of looking at it. There are other ways of looking at it as well. A different way of looking at it is that you're creating, I think, what you would call increasingly complicated basis functions. And then you're clipping those and recombining them to make more complicated functions still. So one way sort of emphasizes this symmetry and this folding that is going on. And another way emphasizes this sort of clipping and creating more joints in one dimension. They would just be joints in a function. In two dimensions, they would be sort of one dimensional folds in the function. And so that's a different partial way to look at it. But I don't think anyone, even for a two layer network, it's really hard to get a full picture in your mind of exactly what it's doing. But what's definitely true is there's now like a really complicated relationship between manipulating any one of those parameters and the whole surface that comes out at the other end.
[00:52:47] Speaker 1: Yeah, and when we did our first show on spline theory, we were showing this neural network visualizer and you can add layers in and you can change the activation functions. And it kind of shows you the activation space. And very, very quickly, you get these very complex behaviors emerge where, you know, the hyperplanes kind of cancel each other out. And what you were just saying there that I want to get to is this topological interpretation, which is to say that as you complexify, as you go through the neural network in successive layers, you can recompose all of the prior neurons in previous layers that are topologically addressable. And this is a DAG that we're working in. So that's obviously a subset of those. And there's this really interesting thing that the neural network has a structure which is kind of defined by the early layers. So as you go through the neural network, it's kind of somehow constrained by what went before.
[00:53:42] Speaker 2: I mean, it's constrained in terms of information. If you lose information at the start, it can't be recovered. But I think you're trying to get at something else.
[00:53:52] Speaker 1: Well, some of it is to do with how, you know, like there's a continual learning dimension to this, which is to say early on in the neural network, those initial basis functions are making chops in the space. And then the way I think about it is all of the kind of complexification that happens later is kind of working inside those early regions. So it's almost like it's becoming more and more entrenched in what it does as you go through the neural network, which is to say that it becomes harder and harder for it to completely learn things differently.
[00:54:28] Speaker 2: I think, I mean, I'm trying to interpret the way you're expressing this, but I mean, perhaps what you're trying to say is that it's sort of collapsing bits of the space on top of itself and then treating those similarly and it can't then uncollapse them. So, and that might be how you could exploit regularities in data. That's one interpretation, but the mirroring, I think, actually kind of happens the other way around, that the early, maybe I misunderstood how you said it, but the early layers are defining how the mirroring will occur, how these reflections will occur in the later layers. So, really, in that view, it's the later layers, the sort of defining the structure and the early layers are then propagating that structure to different, in reflecting it like a crazy house of mirrors into the rest of the space.
[00:55:31] Speaker 1: I mean, I guess, like, one way I think about continual learning is the extent to which you could destroy a neural network. Let's say you take GPT and you just start fine-tuning it on noise. And what would happen? Would it very quickly forget everything it learned before or would it just refuse to budge?
[00:55:52] Speaker 2: Well, that's a complicated question that would depend on the learning rate. Yeah. I mean, these transform models are quite interesting because they generally, as I understand it, and please correct me if you think I'm wrong, I don't think they train them to zero training error. I think they generally do one pass through their training data and stop.
[00:56:14] Speaker 1: Three or four, yeah.
[00:56:15] Speaker 2: So, or three or four. So, it's sort of, to some extent, probably remembering what it saw recently more than what it saw a million tokens go, a trillion tokens go, whatever it is. Yeah, yeah. How quickly, if you added noise, would it decrease? I mean, it really, yeah. So, what have you done by adding that noise in? You've changed the whole lost surface because now the lost surface either only includes the noise or I'm not sure you're suggesting if the noise is added in with the real data. But either way, the lost surface is now different. And you're heading now downhill from what used to be, I'm going to loosely say the global minimum, and I'll probably say that several times during this, but I mean the sort of level set of good solutions that fit the data well. So, you're now heading downhill from that point, which is raised up in your new lost surface. And how quickly, I forget, just depends on the learning rate and the new shape of the lost function. So, I don't know if you could say anything definite about that.
[00:57:30] Speaker 1: I guess I was coming at it from, you know, Chomsky says that we see the world using our native priors. And in a sense, neural networks are the same. Look at a vision network. They have these Gabor filters that are baked into the early layers. And to a certain extent, after a while, the neural network only sees the world in terms of its fundamental basis function.
[00:57:48] Speaker 2: So, I don't like that idea of considering the, you know, you say these Gabor filters are baked into the early layers. And I would push back on that a bit. What people are saying when they say that is that this neuron has a high activation when you put in this kind of thing. And to some extent, that's sort of trivial. You've made some small set of basis functions because this network at this point only sees a 7x7 patch of the image or something. And just because it reacts very highly to that particular thing, it also, you know, if it's not the first layer, it's going to react in a complex way to lots of different things. So, what you're trying to do is you're trying to characterize this very complicated multidimensional polytope by one point that looks like a Gabor filter. But really, it's this incredibly complicated shape in 150,000 dimensional space. And so, I don't like the idea that you're going to describe. You're going to say, oh, well, we can characterize that shape by this one point.
[00:59:03] Speaker 1: Yeah, no, I think you're absolutely right there. So, that is a mistake that we're looking at interpretability methods, which grossly simplify a neural network. And we're kind of making statements based on that. So, I think you're absolutely right about that. But there's also this interesting dimension of training versus inference. So, it has been said that over-parameterizing is something which is helpful for stochastic gradient descent. But I would like your viewpoint on the extent to which we need to have all that representational capacity for inference. So, do we need to memorize all that stuff? Or could we actually strip it away?
[00:59:38] Speaker 2: Yeah. So, it definitely makes it easier. Terry Sinowski has this beautiful expression where he says, you know, it goes from searching for a needle in a haystack to a haystack of needles. So, you wind up with this lost surface where there's a very high dimensional part of that lost surface that is the global minimum, by which I mean a good level set of solutions that fits your data perfectly.
[01:00:06] Speaker 1: Yeah. I have a feeling, and I know that you're going to push back on this, that there is an inherent model bias in neural networks, which is to say a lot of these inductive priors and training tricks and so on, their ways of making the network focus on the modes in the data, the areas of regularity or where most of the variance is, and all of the low frequency attributes. And just to be as clear as possible about that, I mean, things that don't happen very, very often. They tend to be snipped or ignored or they're not learned. Is that something fundamental, do you think, to neural network?
[01:00:41] Speaker 2: So, I think we do need to fit the data perfectly or close to perfectly. So, for our training data, we always get the right output because, ultimately, these things are quite stupid and there's nothing much they can do apart from interpolate smoothly and I'm using the word smoothly very vaguely because it's difficult to characterize exactly how they interpolate between them. And I'd rather interpolate smoothly between the correct true data points than a neural network that missed those in the first place. You know, how could that be better not to memorize the data?
[01:01:30] Speaker 1: So, I have a feeling this is a story of the tyranny of metrics, which is to say we optimize on accuracy and the neural network, I assume, would learn that, well, given that this is such a long tail, high entropy thing, it's probably better that I just don't bother learning it than get it wrong all the time by trying to learn it.
[01:01:51] Speaker 2: But I think it does learn, you know, it does learn from most data sets, the long tail of the data you give it because the loss comes almost zero. So, for a regression model, the loss can literally become 0.000. For classification models, it's a bit more complicated because you're trying to push these softmax functions to infinity so you never get there.
[01:02:16] Speaker 1: Training loss, though.
[01:02:16] Speaker 2: Training loss, yes.
[01:02:18] Speaker 1: But that is not going. So, I guess, like, those things are like atoms in the universe. And even though the network knows about them in the trainings there, they would never have any statistical power.
[01:02:34] Speaker 2: So, that wasn't a question.
[01:02:36] Speaker ?: Well, it was.
[01:02:38] Speaker 1: No, I'm just testing the principle. But I guess, you know, we'll talk about ethics and bias and fairness later on. But, you know, it's really difficult with a single training objective to square the circle, to have high accuracy and high fairness.
[01:02:56] Speaker 2: Yeah, absolutely. I think what you're worried about is areas of the data space that represent, perhaps, minorities of individuals and particularly intersections of those minorities where, you know, perhaps, you know, gender and race or what have you. And so, you wind up with a very small number of training data points. And I think it's a question of whether, you know, you're attributing the failure to generalize that to the neural network. But that might be unfair. Maybe the failure to generalize that is that there is literally not enough statistical regularity in the data you've given it for it to possibly generalize.
[01:03:41] Speaker 1: Interesting. Interesting. Okay. Well, just to close the loop on chapter three and four. So, there are some other things that we tweak in neural networks, like the initialization, things like the learning rate and which activation functions we use and stuff like that. So, can you sketch out some of that?
[01:04:00] Speaker 2: So, I think it's interesting to think about that through the lens of either learning or generalization. So, we've learned quite a bit about which things affect learning and which things affect generalization. Surprisingly, the data set confusingly doesn't affect the learning so much. So, you can randomly perturb the labels.
[01:04:22] Speaker 1: Yeah.
[01:04:23] Speaker 2: And the neural network will still learn the training data fine. Or you can shuffle the input pixels of images. The neural network still learns the training data fine. So, there are some things you'd be really surprised about it can learn despite that the neural network will learn despite those factors. Also, other things in terms of learning.
[01:04:46] Speaker 1: Could we just pause there? Because that was a really interesting bit and we'll show an image in your book. So, there are two curves and one curve is when you're just memorizing essentially random information. And then the other curve on the left, so the horizontal distance between those curves is the generalization gap. So, on the other curve is where you have the actual data with the actual labels. And then there's another curve in the middle where you have the actual data but with, you know, random labels. So, I guess, like, the first intuition is, isn't it interesting that neural networks don't struggle much more to learn completely random data?
[01:05:21] Speaker 2: So, this is a famous paper from Zhang et al., 2017. And it shows that you can perturb these labels or you can shuffle around the input data in various ways. And it still learns perfectly but it learns more slowly. So, but that might just be because there are regularities in real data. So, you move the surface up and it naturally fits through several data points. And now it has to contort itself more to fit completely random data. So, I don't think the length of time it takes to fit it is really significant. But it shows how incredibly flexible the model is that you can throw at it any data and it will fit it and interpolate between it. It's just that the interpolation is meaningless if the labels had no information in them.
[01:06:04] Speaker 1: Yeah. It's interesting that you say you can't intuit how difficult a problem is by the convergence time.
[01:06:12] Speaker 2: So, the convergence time depends on a lot of different factors but one of the main ones is where you initialize the parameters. So, we initialize the parameters for very practical reasons to certain variances because if you don't you get exploding gradients where for people who don't know about this. Essentially, the numbers in your neural network, the activations and the outputs start taking wildly large values and you can no longer represent them in finite precision floating point numbers. Or if you don't set them correctly, they wind up being too small and we call that vanishing gradients. But it's referred to in terms of gradients but actually it happens in a forward pass through the network as well. So, we have very practical reasons for initializing our networks where they are. But actually that also depends, you know, we still have a range of values we can set it to and it still works without breaking the, reaching the limits of the computer's precision. But actually the magnitude you set the weights to drastically determines the training time. So, and it also determines generalization. So, there's a sort of Goldilocks zone, they call it, not too big, not too small, just right, of parameter magnitudes that cause, that seem to give the loss function, sensible properties, positive curvature, where you want positive curvature. And I think that's pretty easy to understand that if all the parameters are too small, then it can only create very flat, I mean, loosely speaking, people who really know what they're talking about are going to take offense to this, but loosely speaking, if the parameters are very small, it makes very flat functions. And maybe your function isn't very flat, and if the parameters are really huge, then it has functions with big, sharp changes, and you don't want that because maybe your data is quite smooth. So, it takes longer, if you don't initialize the parameters correctly, to kind of reach the reasonably reasonably smooth pattern of your data. Yeah. And of course, that also speaks to generalization, because you want it to interpolate smoothly between your data, so you don't want wildly varying, large parameter values that cause big fluctuations between your data points, but still fit the data points. So, the time it takes a train depends on the initialization as well, but if you give it enough time, it seems to get there. So, essentially, you can get cases where it fits the data perfectly quite early on, but then it takes ages and ages, and suddenly it generalizes, and this has been interpreted, or my best knowledge of the interpretation is that this is what happens when you set the magnitude of the parameters wrong. What's happening is it's fit the data correctly, but it is wildly varying between the data points, but something in our methods for training, presumably some feature of stochastic gradient descent, or some other explicit or implicit regularization that you've put in the data points that you've put in the data points. But something in our methods for training, but something in our methods for training, but something in our methods for training, presumably some feature of stochastic gradient descent, or some other explicit or implicit regularization that you've put in there, is causing the network, the solution, even though it already fits the data, sort of traverse this loss surface. It's not at the minimum, it's in a family of minima, and it's traversing through that family until it eventually gets to something that's smoother and interpolates well, and then it generalizes. But exactly why it should move through this surface to something that's smoother is, again, quite complicated and subtle.
[01:10:11] Speaker 1: Yes. I mean, folks, definitely check out the Neil Nander episode. But yeah, as you say, there's this phase change from memorization to sudden generalization. I think Neil did say that it's a bit of an illusion, and people slightly misinterpret how it works. And as you said, we're trying to design the loss surface to be as smooth as possible. And I think his original paper was on modulo division, which is not very natural, because I guess part of why it's possible to coax the loss surface to be so smooth and interpretable is because of this natural data thing as well. So maybe an artificial data set like that wouldn't be so amenable, so you would get this sudden rocking later.
[01:10:50] Speaker 2: Yeah, perhaps you mean that the easiest way to fit it is sort of with some function that, yeah, there's no natural smoothness to it. And consequently, you wind up fitting it with a very complicated function with a lot of variance between the data points. And then it takes a lot of regularization and rambling over this loss surface until we get somewhere sensible. Whereas for a more typical data set, you naturally wind up with a sort of smoother surface to begin with. I mean, that makes sense to me.
[01:11:28] Speaker 1: Yeah, but it is one of the crazy things about deep learning, which is that, and rocking was a surprise, because most of the time, it's like cooking. The secret is in the preparation. So you kind of massage the loss surface, and then you just know, you can almost predict, just like OpenAI could predict how much training was required to get GPT-4 level perplexity. You know, there seems to be a regularity in terms of, okay, it's going to converge, and because it's over-parameterized and finessed in a certain way, it's going to converge in a certain amount of time. It's going to have a certain shape to it when you train it.
[01:12:06] Speaker 2: Yeah, I mean, the most recent paper that I've read on this, and you should understand that I'm very much like ChatGPT, every time I finish writing the chapter of a book, I stopped reading, because if you don't stop reading, you'll go crazy. There's, you know, 4,000 papers coming out every month, and you can't possibly keep up to date. The most recent paper I read on this was called OmniGrok, and basically it says that that time is predictable, because the magnitude of the weights are gradually decreasing at some fairly constant rate. And eventually they come into this Goldilocks zone where things generalize well, and that's when suddenly the generalization performance improves.
[01:12:45] Speaker 1: Cool, cool. Now, the book makes a really, really huge effort, I would say, to visualize things in images and low dimensions. What was the thinking behind that?
[01:12:56] Speaker 2: Well, I think people learn in different ways, so I try and have three ways to understand everything. There's textual descriptions, there's equations, and there's pictures. And so, to me, the mental process of connecting those things, which you sort of actively have to do if you read the book, leads you to a deeper kind of understanding. A lot of ideas, even diffusion models and GANs, can be drawn in one or two dimensions, and you can really effectively convey these concepts in a non-technical way before you hit the equations, which is good for learning. But you also have to be a little bit careful, because multidimensional space doesn't quite work in the way you expect it to. There are a lot of strange phenomena, so if you take two random points that are Gaussian distributed, you know, after about 100 dimensions, or even less than that, they're almost certainly orthogonal to one another. A famous one is that a hyper-orange, a multidimensional orange, basically all of the area is in the skin, and none of it is in the pulp. My favorite one is if you take a multidimensional sphere of diameter one, a hypersphere, and you embed it in a multidimensional cube, you can imagine in two dimensions you've got a circle in a square, in three dimensions you've got a cube, you've got a sphere in a cube. As that goes to infinity, it turns out, as the dimensions increase, it turns out the proportion of space the sphere takes becomes zero. It's all in the corners. Yes. So, I draw these pictures in...
[01:14:41] Speaker 1: Start the question of dimensionality again.
[01:14:43] Speaker 2: Yeah, I draw these pictures in one or 2D. You know, I like to joke that's because MIT Press wouldn't give me a budget for a five-dimensional book. I asked for it, but they stuck with two dimensions.
[01:14:54] Speaker 1: Yeah.
[01:14:55] Speaker 2: But you should be cautious. But having said that, I don't think it's necessary to work in the super high-dimensional spaces we work in to see the phenomena of deep learning. And there's a really interesting data set that I use a lot in the book called MNIST 1D. So, our simplest data set is MNIST. It has some disadvantages in that. For example, we're too good at solving it now to really demonstrate much on it. But this makes an even simpler data set. And basically, what it consists of is 40-dimensional data. It's just one-dimensional data, and they start with a template that looks a little bit like one of the ten digits. So, for example, the zero has one bulge, and the eight has two bulges. And they find a way to... Actually, it's one guy, Sam Graydanus is his name, and the paper's called Scaling Down Deep Learning. Oh, right. Completely against the prevailing winds of deep learning, which are to scale everything massively. And then that is translated, so it's a type of data that's amenable to convolutional networks, for example, because there's a dimension that you can be equivariant, or an aspect of it you can be equivariant or invariant to, and adds various amounts of noise. And it's completely procedurally generated, so you can generate any amount of data with any amount of noise. And the interesting thing is, even though this is only 40-dimensional, most of the kind of epiphenomena of deep learning can be seen. So you can show adversarial examples, you can show lottery tickets, you can show double descent, and you can do all of this. You know, this is now in a size that you don't even need the GPU. You can just run it in your local Python window on the CPU. And I think this is a really interesting data set. So the actual huge experiments we're building, and we're now building things that's sort of the order of the complexity of a human mind. And I come from a biological sciences background, basically, and if you want to try and understand that, what you do is you go out and you collect a data set as thoroughly as you can, along as many dimensions as you can, and then you go away and try and come up with a theory that explains that data set. So in this case, when is it trainable, when is it generalizable? And here we have a data set where, in theory, you could do that. You could train many different networks, varying all the parameters, because it's very quick to train. And then you could try and properly come up with a theory that says, okay, when I've got this kind of data with this amount of noise, these statistics, these are the networks that will work. These are the networks that will train. These are the networks that will generalize and have proper, tight bounds on those things. So to me, this is a really interesting data set that could be a test bed for actually understanding deep learning better, but which is currently not used at all because you can't publish a paper unless you get state-of-the-art on some enormous data set with millions of examples in it.
[01:18:04] Speaker 1: Yeah, because we were going to talk earlier about the alchemy of deep learning, because we don't really have any overarching theories of deep learning. I mean, there's things like NTK, and we spoke about spline theory and stuff like that, but there's not really much, is there?
[01:18:18] Speaker 2: There are bits and bobs. A lot of it relies on unrealistic assumptions. Some of it often relies on the network being infinitely wide, which actually I'm less worried about those ones because infinitely wide probably means somewhere along the way you're relying on the central limit theorem or the law of large numbers, either of which probably converge a long time to something like a Gaussian or the expected value a long time before you really are infinite. So those ones I'm kind of okay with, the ones that rely on, you know, the network width being the square of the number of data points or something, that starts to worry me because that's sounding like it's an argument based on some sort of combinatorial geometric argument, and that really might need that true number to work. There is a history of the experimenters rushing ahead, doing things, often introducing components to our networks, which definitely empirically help explain them in some way. And then later on, when we go back and look at those, it turns out they don't do exactly what we said they would. And the theory people lag a long way behind. You know, some interesting bits of theory, I think, are neural network Gaussian processes. You mentioned NTKs. They're sort of, I think of them anyway, is the Bayesian version of that. And so they can predict some things about trainability, but largely they predict things about trainability that we already knew from experimentation. So, and they've, you know, using some of that, they have built networks with thousands of layers, but those networks with thousands of layers don't get cutting edge results. So you have to wonder, had the experimentalists already discovered this, but then they can't get a paper out of it, so they just throw it away. Yes. So, not to note that work, I think it's really interesting, really important, but just to say it lags behind. And we're in this weird situation where a whole bunch of the apparatus we've put in to make networks work doesn't do exactly what we think it does. So when you first learn about stochastic gradient descent, you're taught that the reason we put it in is so you can bounce over local minima, perhaps into another valley. But that turns out to be absolutely nonsense. I don't think that's what's happening. You can learn quite large neural networks with enormous batch sizes, so not much randomization or sometimes even full batch, and it still gets to the bottom. So why do we still use stochastic gradient descent? I mean, number one, it's fast. You're only using a subset of the data points, and if you've got millions of data points, that saves you a lot of time. Number two, it seems to have some regularization effect, and you can actually characterize that, and there's a section in the book about implicit regularization, so you can show for finite, basically it's the difference between it taking infinite, these small steps, which you would call gradient flow, and finite steps. And once it starts doing that and combined with the stochastic element, you can show that there's certain regularization expression that you can write in closed form that you're effectively adding to the loss function, and so the whole learning takes a slightly different trajectory and gets to the bottom in a different place. Another example of experimentalists sort of rushing ahead would be BatchNorm, which has a really peculiar history. So BatchNorm...
[01:22:00] Speaker 1: Is that a good fellow? Well, no, is it Segeti?
[01:22:04] Speaker ?: Segeti.
[01:22:04] Speaker 1: Segeti. Yeah, I had him on the show, yeah.
[01:22:09] Speaker 2: BatchNorm was introduced originally to tackle something called covariate shift, which I'm not sure I totally understand it, but I think basically the idea, you know, you can correct me if I'm wrong here, is that after you've adjusted the parameters in the later layers, the changes to the parameters in the earlier layers no longer make any sense, which from a pure mathematics sense doesn't make a lot of, you know, it's not understandable, but I guess it's to do with the fact that you're not actually taking infinitely small changes, you're taking finite changes. But subsequent experiments have showed that you can introduce covariate shift and it doesn't help. And BatchNorm was later adopted because after they figured out how to stop exploding gradients and vanishing gradients, they developed residual networks and that put those problems back in. And BatchNorm resets the variance and consequently stops gradients exploding. So that's, so it found a use, but then it turns out, but there's other ways to solve that problem. You can solve that problem by just resetting the gradients without the taking into account the batch statistics. It turns out that it has a regularization effect because it's basically adding random noise. The batch is slightly different each time. It also has more complicated effects where it allows one bit of data to sort of leak information to another because one can have a really enormous value which makes the batch variance large which then shrinks the other variance of the other one. And that's why they don't use it in transformers where you're using masked attention because the whole point of that is that the words further along in the sentence don't have access to the data further along in the sentence. You're sending this on the batch, so they use layer norm instead. So BatchNorm is an example of something that was introduced for one reason, adapted for another reason. It has, it's now kept in there or some form of it is kept in there because it has this indirect regularization effect and it also has some positive disadvantages that cause it to have to be adapted in certain circumstances.
[01:24:26] Speaker 1: Yeah, but I was going to make a comment that if anything, as time has gone on, we seem to have less reliance on some of these tricks and almost more focus on just big models, more compute, more data and so on. Do you think there's any truth to that? Because you just said yourself that in many cases, people don't even understand exactly what the effect of this thing is. When they do ablation studies, they tend to learn actually this thing wasn't needed after all. Is there a trend towards simplifying the process?
[01:24:55] Speaker 2: I'm not sure there is a trend towards simplifying it. I think there should be a trend towards simplifying it. There's a great paper that came out in around 2020 where they analyze all of the things that have been done to transformers actually scientifically in these different papers and find that hardly any of them make any difference. Yes. And so, you know, it's tied to our obsession with state of the art. You try all of these different things and then you don't interrogate too closely whether this particular activation function really was critical because now, you know, it's two days before the NeurIPS deadline and you've just got the state of the art and so people, you know, and then no one ever gets around to going back and looking at these things. But I know when transformers were introduced, they were super complicated to train. You had to use this thing called learning rate warm-up. There are very complicated effects of where you put the layer norm, what happens with the variance in the residual layers, the difference in the size of the parameter gradients as you go through this softmax function versus the path that computes the values versus the residual. You know, you've got three parallel links and they were super hard to train and they relied more on training things. But actually, the sort of state-of-the-art training transformers, which I guess is sort of what you're implying when you're talking about very large models, I don't know if they've eliminated. I just don't know the answer to your question.
[01:26:28] Speaker 1: Yeah, I mean, it's kind of, it's related to this graduate student descent thing that we've been talking about in general. And I mean, I was speaking with some folks at Neurips and I mean, Hinton, for example, was always a big fan of contrastive learning where you kind of like, you know, you contrast an image to all the other ones in the data set. And then I think Lacoon at FAIR kind of pioneered this non-contrastive approach, which is where essentially you kind of, I think like similar to Siamese learning, but you mutate pairs of images. So you're kind of like pushing images off the manifold rather than comparing them to everything else. And there are so many variations of that, like Barlow twins and, you know, God knows what else. It's been a while since I looked at it. But these things are all just millions and millions of minor engineering variations on an idea.
[01:27:16] Speaker 2: Yeah. And it's interesting that this is now what we take to be results in a scientific conference. We don't necessarily value insight. There are papers that are searching for insight, but it's a smaller community that is lagging behind. And I mean, personally, it's part of the reason why I stopped doing experimental research was that I really love the ideas and I really love trying to understand what's going on. But it often takes, you know, graduate student dissent. It takes the supervisor to, you know, an afternoon to come up with an interesting idea and then, you know, 20 minutes to persuade the graduate student who was really their idea. And then, PhD supervisors, you know what I'm talking about. And then off they go for like six months experimenting to get the criteria they need to get into the conference. And I don't think that's a good use of scientific time or public money. That's probably a good use of Google's money and OpenAI's money. But is that a good use of Stanford's money maybe not. Maybe they should be doing something that's a bit more scientific and that has some genuine insights not trying to push this benchmark by another 0.1%. Because often it's just sort of chance as well. There was just a couple of weeks ago there was a paper that argued that actually ConvNet's do just as well as vision transformers on or certainly comparable to the best vision transformers on the image net classification task it's just no one's ever pre-trained them with Google's enormous training database before and deployed them at such enormous scale. So we can't even necessarily trust all of these results. I mean we can trust them in that you know in 99% of cases they did what they did and they got the results that they got but we can't trust that the scientific conclusion that vision transformers are way better than convolution until someone's properly done some science to do an exactly comparable experiment. We're not very scientific community we've got better in the you know previous decades it was crazy people would people would make claims they'd say okay I've invented this new method but actually they changed like seven things and the new method would often not be the thing that made the performance gain and we've got slightly better because now we do do ablation studies and but we're still fundamentally not very scientific where you know these are people who are trained as engineers not scientists.
[01:30:04] Speaker 1: 100% and even when they do ablation studies it's very expensive and it's now become a game that only the top players can play and OpenAI for example I mean they've done some very cool research over the years you know about scaling laws and RLHF and you know CLIP and GPT and blah blah blah but you know ultimately and I don't want to sound cynical but the success of their methods is just so much data they've built a Google vNext organisation they're hoovering up all of this data and I think a lot of people underestimate how valuable the data of people using chat GPT is so they're hoovering data oh absolutely
[01:30:44] Speaker 2: there's a reason why they let people use it for free
[01:30:47] Speaker 1: oh 100% so it's like it's one step away from a mechanical Turk so in the background they're just generating more and more data and they've cultivated this image that they're doing science and for reasons that we've just discussed I think they're doing engineering and yeah I think it's I'm really looking for something different now
[01:31:11] Speaker 2: yeah I don't mean any disrespect to the engineers it's no doubt incredibly difficult to engineer chat GPT on that kind of scale yeah but yeah again it's not very intellectually satisfying sometimes those papers there's not many insights that you can get from them other than I suppose it's really interesting that you keep adding data and it you know it does keep scaling that's not something we necessarily would have predicted and so that the amount of data perhaps will be the limit on these kind of models not even computation or model size it'll literally be the amount of data that you can hoover up off the internet I we'll see what happens but yeah there's often there's not many insights and I would like to reiterate that you can get most of the interesting phenomena of deep learning in 40 dimensions and so if you want to do science or you want to try a new idea don't even try it on MNIST try it on MNIST 1D and if it works there then you know then try scaling it up and if it's not an interesting idea when you try it in your tiny data set in 40 dimensions where no one cares what the state of the art is then is it worth writing a scientific paper about it
[01:32:35] Speaker 1: I know I think one thing that happened with GPT is that it transgressed the uncanny valley and you know we were talking earlier about how people psychologise and use all of these other adjacent mental frameworks to understand what's happening the reason why people are so excited about the singularity and AGI and all of this stuff is because we've got this artifact now which is remarkable and it's just memorised all of the data on the internet and I think that there's more to cognition than just knowing you know just being able to retrieve information I think it's something that is very very good used interactively by a human so I'm using it like I would use Google and I'm searching for information and it's doing things which are remarkable in some
[01:33:18] Speaker 2: ways you're still the intelligence there it's what you're asking that's the interesting thing not the way that it responds
[01:33:24] Speaker 1: 100% and that's why when you set it up in an autonomous arrangement so something like auto GPT or even getting it to do something on a schedule the magic disappears it's like this uncanny valley or you know like the suspension of disbelief or whatever it just instantly disappears and it does stupid things and you think oh I can't believe I was fooled by that I was fooled by randomness there's like the UX and the fact that it's it's an extended mind it's what you know David Chalmers and Andy Clark called an extended mind essentially so it's like a cognizing element as part of my physical you know process of cognition but so many people are kind of taken in by it in that weird way
[01:34:08] Speaker 2: yeah I think you have a more I think you have a higher opinion of it than I do like my experiments with it haven't been super successful I I've asked it repeatedly to generate a story given a few premises it almost always generates the same story with even the same name of the protagonist and it knows really well about some things because there's a giant corpus of data on you know and unfortunately it knows really well about the things that computer scientists know about because they are present in abandon on the web so if you want to ask a bit about search algorithms it really knows about search algorithms and it can write code for them because there are is repository after repository implementing different sort code algorithms but you know once you get to the fringes of its training data then it hallucinates or you know they're trying to stop it doing that so it sort of craps out altogether I agree that it's crossed some uncanny valley and it's obviously the first thing that's sort of captured the public's imagination for the first time you know I'm going to talk to my non-ai friends it is possible to have non-ai friends and they're asking me about oh do you know about chat GPT and then interestingly they often have very strong opinions about how it works which are completely wrong and often don't ask me the domain expert what's going on which I find curious too so people have their own theories question well I mean I guess
[01:35:55] Speaker 1: you know because like Sutzkever and Hinton are saying it could lead to super intelligence and I guess based on the conversation we've just had we know we know that we're interpolating a data manifold and we know that these you know from a computational point of view it's got fixed compute fixed width and so on so like how could it possibly be a super intelligence
[01:36:20] Speaker 2: I don't think that it could I mean it would have to start somehow manipulating some its internal representations to make them more coherent to come up with new ideas and then test those and embed them in its body of knowledge it really has no way to learn anything new other than its context vector which it forgets every session yeah so even if it did have a way to manipulate the information that it had introduced into that convex into that context vector and formulated into something new you know perhaps perform some logical deductions on it or come up with some new hypothesis based on it has no way to even remember that so we're missing all kinds of parts of the puzzle I don't see how you could look at that and think oh super intelligence is around the corner yes I mean question for you I don't like talking about super intelligence because I think it's just untalkable about topic but we could talk AGI so how far is it to AGI
[01:37:32] Speaker 1: well I mean I guess part of the reason I find it strange is you know I think these folks think that there is such a thing as a pure intelligence and if it has just learned a data manifold then that must be situated because that's the data we've produced in our physical environment here on the planet so you know all of this has been reduced into a data representation and we're interpolating that as you said it's non-interactive so it can't you know seek new information and it's non-reflexive so it can't reflect on its own context or anything like that so it's just a an information retrieval system which has become part of our cognitive process
[01:38:16] Speaker 2: yeah I don't even like the word intelligence I think capability is a better word to use it has certain capabilities some of those capabilities are better than humans it can tell you right now the recent history of epigenetics off the top of its head and I cannot do that I think it would be much better to stop using the word intelligence to start talking about capabilities because that's a very concrete thing you can say it can do this task or it can't do this task and what does AGI mean it means that for some broad set of capabilities it can do a large enough proportion of them that we think this is general intelligence and that's measurable
[01:39:04] Speaker 1: I'm not a huge fan of that and I guess I'll explain why so yeah I mean intelligence I love this Chinese proverb of Mount Lu which Pei Wang talks about it's an intelligence researcher and he basically says it's a complex phenomenon that is beyond our cognitive horizon and just like this mountain range Mount Lu depending on your perspective of the phenomenon the range looks completely different so you do get these different perspectives like capability behaviour structure principle and stuff well sorry I guess I told you my perspective but the reason I don't like because capability is like you know a behavioural interpretation of intelligence and the reason I don't like that is GPT doesn't do anything you know because all cognitive processes are kind of externalised and physical and so on and you know we enact a cognitive process when we use it so the artefact itself in some sense doesn't have any measurable capability
[01:39:57] Speaker 2: I think I disagree with the premise that GPT doesn't do anything well
[01:40:05] Speaker 1: but it doesn't I mean it's just a bunch of I mean I guess this is quite an esoteric externalised argument of cognition because you could argue that a chess computer like the meaning of the computation is in its use so it doesn't intrinsically do anything you know we use it as part of our cognitive physical processes
[01:40:22] Speaker 2: give an example of something that does do something I think you're using this word in a this phrase in a very specific way that I don't quite follow
[01:40:34] Speaker 1: well we do stuff we scuba dive you know we actually do stuff and again maybe I'm just being a human chauvinist but I'm completely on board with it becoming embedded in our cognitive ecosystem auto
[01:40:49] Speaker 2: GPT does stuff not very well but it does it well does it
[01:40:55] Speaker 1: though I don't know does it do I mean yeah like in the way a cat flap does stuff you know you can make it do things it can execute code and so on but I wouldn't say it has agency doesn't do it I mean you know like agency is about being able to wrestle with the complexity in your environment as an observer with like limited compute as Wolfram would say so you know it's about it's about that kind of complexity differential between the situation I'm in versus the information I have access to
[01:41:30] Speaker 2: and so I feel you're going to push back on this but how do you see reinforcement learning then isn't that exactly what you're talking about there's literally something called an agent that has agency maybe via only a sort of dumb algorithm that does something random until it hits on some reward but it explores its complex environment it solves problems sometimes because it's getting direct rewards sometimes because of priors we've put into it like essentially curiosity you don't see that as doing something
[01:42:10] Speaker 1: I don't know I don't see that as dramatically different from the cat flap scenario I guess it's I mean I love this Fristonian idea that agency is actually a description of certain dynamics you know which is that you get this kind of like locus of information density where a thing is effectively planning many steps ahead so you could say therefore it has agency so in the case of reinforcement learning the thing I don't like about that is again we can talk about free will but someone has designed a reward function the final chapter of your book is all about ethics and algorithmic fairness and right now there's this interesting juxtaposition between near-term safety and long-term safety in fact I'm concerned that the word safety has been conflated in its meaning and people are using one word for long-term and near-term risks and in your book I think you were mostly focused on the near-term risks the bias the misinformation the fairness and stuff like that but yeah tell us about the chapter
[01:43:13] Speaker 2: right so yeah yeah let me tackle the chapter first and because the chapter was co-written by which I mean he did all the work with a researcher called Travis Lacroix who's at Dalhousie University which is in the east of Canada I didn't feel personally I'm an engineer I didn't feel personally confident that I could write a intellectually coherent chapter on ethics but I was strongly encouraged to I mean it's sort of interesting that in an engineering textbook now we need a chapter on ethics and that's because of the sort of force magnification of this technology it's very powerful you can read this book and go and build things that are not good for society so it's imperative that you at least think about ethics I'm not trying to tell you what to do so I wrote this chapter with Travis who by the way has his own book coming out next year on value alignment which people should check out and he comes with philosophy background so we went through a whole bunch of iterations of him using long philosophical words and me trying to chop that down until it could be easily understood by engineers and we cover some aspects of this that are commonly discussed in our world as you say bias explainability things like that some more philosophical things about moral agency and value alignment which are more discussed probably in his field I think I only let him use the word epistemology once as a compromise but I think the interesting thing about that chapter is more the kind of conclusion of it which is you know it's an appeal to scientists to realise that you have to take be aware and take responsibility for the actions you take and that you think you're just doing maths and this is value free but everything you do is they would say value laden and that's often deeply deeply baked into what we do so much so that you can't even see that so the very fact that we judge our papers based on only state of the art whether you've got state of the art and we don't report whether they're fair whether they're explainable how much carbon needs to be emitted to train them or to run them that is a value that is baked into our community that we don't necessarily think about and so the book ends with an appeal to sort of interrogate yourself and think about the impacts of your work which communities it's going to affect individual scientists don't necessarily have a huge amount of control but you have some control you can choose who you work for you can choose which problems you choose to work upon in some sense it's a collective action problem in that individual people don't have very much control but collectively you can do things if you organize and decide that this is the kind of things we should be pursuing or not pursuing and so that's the way the book concludes and then I'd like to sort of divorce the discussion of that from any of my personal views on what the risks of AI are which don't represent Travis's views and also not the views of the University of Bath let's have that on camera the book does focus on near problems and largely bias explainability and so on and touches slightly on near societal problems mainly just because they're much more concrete and easy to talk about than the distant problems I think
[01:47:10] Speaker 1: there's a good or harder to talk about depending on how you look at it
[01:47:13] Speaker 2: yeah okay yeah so I mean that's sort of what I'm coming around to I think there actually are worries at all levels from the people who are worried about AGI and beyond AGI I'm not even going to use the word to midterm societal levels to level you know to problems that are there right now and I intrinsically disagree with anyone who says all of the problems you know we need to totally concentrate on this and that sucks the oxygen out of the discussion about the other thing we're essentially an extremely rich community you know if we put for every hundred million dollars of venture capital we had one person working on one of these problems that would amount to a lot of people and we could study this at all levels at the very distant level of AGI and the word that I'm not going to mention it's really difficult for me to I think they undermine their own arguments by being so inconsistent and by you know on the one hand you know you can read books on this and on one page they say we shouldn't anthropomorphize anthropomorphize and we shouldn't on one page they say we shouldn't anthropomorphize systems that are more intelligent to us and then on the next page they say well obviously it will want to do this and it will worry about doing this and it will try and stop us doing this and so often arguments are sort of incoherent and you know and then huge assumptions are made frequently people say well obviously it will have free will or obviously it will have a sense of self and so I don't honestly really want to engage with that end of things I don't think I have anything more intelligent to say about it but it doesn't mean that we should totally not worry about it that I think it's really great that people are worrying about the short term issues bias and explainability I'm pessimistic about explainability more optimistic about bias I think the thing that isn't talked about very much is how we manage bringing this technology into society so we are going to replace an enormous number of jobs and it's not the way that Jeff Hinton said it there'll be no more radiologists what's going to happen is one radiologist will be able to do the job that was formally done by 20 radiologists and you see that happening already now gmail automatically completes your emails and consequently if your job is mainly about emailing you know you can do that 1% faster and you need 1% less people to do that and already what we've got just the latest version of chat GPT and DALI3 they can do a lot of work which is going to reduce the number of people we need to accomplish things in companies and I think cause a very large number of knowledge workers managers HR people to become unemployed in a short amount of time McKinsey I think in 2018 estimated there could be 800 million people made unemployed by 2030 I don't know if they're sticking to that but I'd say if anything we're moving faster than we might have anticipated and this is really worrying for society because a sort of major reason a major factor that causes civil unrest and instability is people that used to have high status but now have lost that status and you're going to create that situation en masse very quickly people have entered into a compact with society where they put off earning money for a long time so they could train to become a lawyer or train to become a doctor and now suddenly their status has all gone there you know we just don't need that many lawyers or doctors anymore I think eventually we will create new jobs to replace those people but technology is coming at us very fast like a wall and I don't think society can adapt so I actually for reasons completely different to the AGI people think we should try and slow down deploying some of this technology into society
[01:51:47] Speaker 1: yeah so it's interesting that you have that take I'm a little bit more sceptical on the job replacement and the enfeeblement part mostly because I think people are still overestimating how good this technology is so for example we've had co-pilot which is a thing that helps software engineers generate code and I don't think you'll really find any software engineers that will tell you it's going to replace them in fact most engineers don't think it even makes their job that much easier and that's because there's so many layers of complexity to being a software engineer and having like a code autocomplete really doesn't even move the needle
[01:52:28] Speaker 2: right the syntax of programming is not the difficult bit the difficult bit is organising the code base in a sensible way
[01:52:34] Speaker 1: oh absolutely and I think you know I don't want to trivialise the there's going to be huge job displacements and market changes and so on but there are I don't know I still think the good old fashioned risks around algorithmic fairness and bias are still number one I'm reading this book called Broken Code which is all about Facebook and how they did all of their internal engineering and there was a real revolution at Facebook after about 2011 when they changed the news feed from being chronological to being a recommender system and everyone hated it at first but they forced it on and essentially any kind of human curation in many different parts of Facebook got weeded out whether it was the news feed for example they weeded that out and they almost wanted to weed it out intentionally because they didn't like it when people said they had a conservative bias or something like that and it meant that they could absolve themselves of any responsibility because the algorithm is doing it and it just giving people what they wanted and it created a sewer there were real risk that we're just starting to trust the algorithms to do stuff for us
[01:53:51] Speaker 2: I think it is a risk but I will point out to you you've just done exactly what I cautioned gets which is I think there are problems at all levels there are you know there is a huge problem with bias and with you know algorithms filtering our information we get but that doesn't mean the this is how the dialogue always proceeds is that you say oh bias is really important and I say oh no no unemployment is important and we don't we need to talk about both of these things this is you know it's sort of your reaction is exactly what everybody's reaction is which is to
[01:54:37] Speaker 1: minimize the employment
[01:54:39] Speaker 2: which no it's just to talk about the thing that they care about the things you are talking about aren't important but I want to broaden that debate the reason that I bring up unemployment is because I don't think it's talked about enough and I think it's almost inevitable or you'd have to make really powerful arguments to me to say that people are not going to be able to do their jobs more efficiently in the coming years and that's not going to wind being hired in the first place young people just not you know if you're a designer at a company you know let's imagine you make greeting cards and you used to by hand draw those greeting cards but now you draw one in the style you like and then you just sort of sketch the next ten and say you know fill it in in the style we have you know you've just gotten rid of a whole bunch of greeting card designers I don't think there's a good argument against that people don't need more greeting cards or it'll take a while to find more human needs for more people to fill and I do think there are things that governments you know we could be thinking about actually how to tackle this how you know so one thing that might slow it responsible for the effects of their products so if I you know once we're off mic tell you the recipe for nerve gas and you go and deploy it I will be an accessory before the fact and when you get arrested I will get arrested and we'll both go to jail but if chat GPT does that I'm pretty sure OpenEye have a clause in terms and additions that you didn't read that says in no way are they responsible for that if you started to make them responsible for that that would be a way that you could slow things down and make things safer and I have no doubt that people would still be able to make plenty of money in this environment and you could have more stringent you know they're starting to get at this Joe Biden's executive you got to think a bit how tax is going to work so well
[01:57:05] Speaker 1: could I touch on something you said though Richard I think we need to distinguish the diffusion of skills versus agency so I think at the moment these things are glorified calculators
[01:57:17] Speaker 2: so do I but calculators made they got rid of people who used to calculate
[01:57:22] Speaker 1: they didn't and that's that's very true but right now all of the software service to do something because what the things don't do because what's really concerning if it ever happened is the automation of our agency so and from an algorithmic fairness point of view if GPT really was creating a skill program and it was being enacted on a server without any human supervision and doing stuff and no one was responsible that would be a huge problem but I don't think that is a problem
[01:57:55] Speaker 2: no I don't think that's a problem yet anyway although it may come a time where we gradually cede control because bit by bit the computer does it a bit better right there's also the kind of real risk of de-skilling so if for example we have self driving cars that drive themselves most of the time you get less experience driving and then it kicks you dangerous and you've now only been driving for a few hours in the last year instead of hundreds of hours and you have less experience this is sort of connected to me you've sort of gradually given up your agency and that's a real thing there was a air France crash in the South Atlantic that was attributed to exactly that the autopilot kicked off as I understand the situation and it turned out the autopilot works and crucially they didn't have experience in near dangerous situations because the autopilot was good enough to handle that it kicked them off in an extremely dangerous ambiguous situation where sensors weren't working and so there's a possibility that we will seed our skills gradually and only be called to use them in emergency situations which is not a good thing I don't think that's quite where you were getting at but
[01:59:25] Speaker 1: no no that's really important so I agree enfeeblement is a huge problem although you could argue that doing some kind of easily automated skill because we're getting into like purpose and moral value here and you might argue that society at the moment recognises you doing some trivial job as having value and in the future we might not see it that way but I agree that the slippery slope is the problem which is that they're still doing the same job but you're eroding their free will and they're becoming automatons and you're setting them up to fail
[02:00:04] Speaker 2: yeah yeah
[02:00:05] Speaker ?: yeah
[02:00:05] Speaker 2: I mean I think that might be happening in certain warehousing kind of jobs already that you're essentially using them because you can't build robots quite well enough yet yes keep talking
[02:00:21] Speaker 1: well what's the solution to that so because it's almost it sounds like a luddite position to say well let's just keep them busy in the warehouses rather than free them up to do something different
[02:00:35] Speaker 2: I mean you're asking me for solutions to problems that no one has solutions for I do think a sense of purpose and a sense you know human beings in my this really is my personal opinion not science are built to work and they're not happy if they don't accomplish things it's a very common phenomenon people retire and they really love it for the first year and then the second year they feel kind of purposeless and lonely and especially if their job was something that strongly you know sort of corresponded to their sense of self like a doctor or something then they feel purposeless and unhappy and we are drifting towards that there are certain people probably like yourself who are autodidacts who are going to go crazy at the point when they have no responsibilities
[02:01:33] Speaker 1: and
[02:01:37] Speaker 2: me too but unfortunately the whole world isn't built you know we're not at all a representative sample of the population so so
[02:01:43] Speaker ?: so
[02:01:43] Speaker 2: so
[02:01:43] Speaker ?: so
[02:01:43] Speaker 2: so okay but
[02:01:45] Speaker 1: I guess I'm saying that there's an element of paternalism to this argument which is I mean I agree with you right now and it might be because of the way work has been enacted for centuries that our society and recognition has kind of built purpose around that and sometimes you have to do strange weird things for quite a while for society to change and are we saying I almost the progression of what will come next
[02:02:15] Speaker 2: I mean I think you're sort of suggesting a grand experiment where we enter a phase of chaos and society reorganizes itself around different principles well
[02:02:31] Speaker 1: no but that's actually a really cool point because this gets to the core of what we're discussing also with long-term risk which is the word risk is actually the key operative word here which is do we allow scary things to happen or should we prevent them from happening
[02:02:48] Speaker 2: well you know this is an interesting thought experiment but let's have a thought experiment and I'll put you
[02:02:56] Speaker 1: on
[02:02:56] Speaker 2: the spot you've been asking me difficult questions all afternoon there's a switch and you can flick it and it makes you know so this isn't really science fiction we pretty much know this is possible or most people agree that it's possible there's a switch you can flick it and AGI is created but just to human specs so you get you know you know you can create a TIM and let's you know let's make it sort of concrete so TIM can output at the speed of chat GTPT 4 now and can draw pictures at the speed of DALI 3 and can but can communicate nearly instantaneously in the way that computers we already know do and remember an enormous number of facts and you get the choice and the technology for TIM 2.0 is going to be randomly allotted to a big technology company and let's say one of the nuclear powers you spin the wheel it's going to go to one of them you don't know which and your question is do you flick the switch would you flick that switch and knowing that tomorrow they're going to make 100,000 TIMs on their servers and set them doing something hopefully just to maximise their shareholder value would you do that
[02:04:12] Speaker 1: I think this is a great example of we can't predict the future and you know the road to hell is paved with good intentions and there are so many situations where we trust our moral intuitions and even the long termists make this argument that they think they can see the future and in this particular case it could easily end up being a good thing
[02:04:33] Speaker 2: so you're saying you haven't quite answered the question
[02:04:37] Speaker 1: well no the honest answer is I don't I don't know but I guess this is a great example where our moral intuitions are almost always wrong
[02:04:47] Speaker 2: right so I don't know the answer
[02:04:50] Speaker 1: so the intuitive answer I think most people would say no don't do it
[02:04:54] Speaker 2: I think I would say no too
[02:04:56] Speaker 1: yeah
[02:04:57] Speaker 2: and your answer
[02:04:59] Speaker 1: I'm fairly ambivalent to be honest
[02:05:03] Speaker 2: yeah but you've got to do it the switch is there you either flick it or don't there's no there's no being ambivalent means not
[02:05:10] Speaker 1: well the only reason I wouldn't flick it is you know like it's the personal identity thing in philosophy so I don't want there being lots of other tims around if there was someone else
[02:05:19] Speaker 2: if it was you
[02:05:20] Speaker 1: I'd flick the script because there wouldn't be psychological continuity with the other tims
[02:05:26] Speaker 2: so the reason I think everybody who works in AI should think about this question is that if you work for a company that has AGI baked into its core if you work for open mind open mind open mind or deep AI open AI or deep mind who you know who literally write that AGI is their mission you are moving that switch a tiny bit every day and you should think about whether you you know if you so you've got this huge diffusion of responsibility well it's not me there's thousands of people all working on this but you're moving that switch it's moving up towards the halfway point to the point where it just flicks upwards or downwards depending on which continent you're on and so you should ask yourself that question if you're actively working towards that goal and if you as you say most people the answers are no maybe you should work for a different company
[02:06:20] Speaker 1: anyway Professor Prince this has been an absolute honour
[02:06:24] Speaker 2: it's been a real joy to talk to you I've watched many of your podcasts I've learned so much from them and amazing I look forward to the ones that come out in the future
[02:06:35] Speaker 1: cool thank you so much for joining us today
[02:06:37] Speaker 2: thank you
Related Transcripts from Machine Learning Street Talk