| Episode | Status |
|---|---|
We’re told that AI progress is slowing down, that pre-training has hit a wall, that scaling laws are running out of road. Yet we’re releasing this episode in the middle of a wild couple of weeks that ...
There is this thing that's happening in AI. And in AI, every week now, a lot is happening. Fundamentally, if you look at AI progress, it's been a very smooth exponential increase in capabilities. This is the overarching trend. It's not like pre training fizzled out.
It's just we found out a new paradigm that at the same price gives us much more amazing development, and this paradigm is still very new. I think one of the biggest things that I would say people kind of know on the insight and others don't is that already right now, it's not about the progress. There are so many things chat or Gemini, any other one can do for you that people just don't realize. You can take a photo of something broken, ask how to repair it. It may tell you.
You can give it a college level homework and it will do it for you.
Hi. I'm Matt Turk. Welcome to the MAD podcast. My guest today is Lukas Kayser, one of the key architects of modern AI who has quite literally shaped the history of the field. Lukas was one of the coauthors of the attention is all you need paper, meaning he's one of the inventors of the transformer architecture that powers almost all the AI that we use today.
He's now a leading research scientist at OpenAI helping drive the second major paradigm shift towards reasoning models like the ones behind g p d 5.1. This episode is a deep exploration of the AI frontier, why the AI slowdown narrative is wrong, the logic puzzles that still stump the world's smartest models, how scaling is being redefined, and what all of that tells us about where AI is heading next. Please enjoy this fantastic conversation with Lukasz. Lukasz, welcome. Thank you very much.
There was a narrative, at least in some circles, maybe outside of San Francisco throughout the year, that AI progress was slowing down, that we had maxed out pre training, that scaling laws were hitting a wall, yet we are recording this at the end of a huge week or couple of weeks with release of g b d 5.1, g b d 5.1 Codex Max, g b d five one pro, as well as Gemini, Banana Pro, Grok 4.1, Almost three. So this feels like a a major violation of of that narrative. What is it that people in frontier AI labs know about AI progress that at least parts of the rest of the world seem to not understand?
I think there is there is a lot to unpack there. So so so so I I want to go a little slower. There there is this thing that's happening in AI. And in AI, every week now, a lot is happening. You know?
New model, coding, doing slides, self driving cars, images, videos, You know, there there there's it's it's a it's a nice field that that doesn't make you be bored for a long time. But through all of this, it's sometimes hard to see the fundamental things that are happening. And fundamentally, if you look at AI progress, it's been a very smooth exponential increase in capabilities. This this is the overarching trend, and there has never been much to make me at least, and I think my my colleagues in the labs believe that this trend is not happening. It's a little bit like Moore's Law.
Right? We we Moore's Law happened through decades and decades, and arguably, you would say it's still very much going on, if not speeding up with the GPUs. But, of course, it did not happen as, like, one technology was bringing you there for forty years. There were was one technology and then another and another and another and another. And and and this went on for decades.
Right? So so from the outside, you see a smooth trend. But from the inside, of course, you know, progress is made through new developments in addition to the increase of computer power and better engineering. And so so all of these things come together. And in terms of language models, I think there was a big pivotal point.
I mean, one point was, of course, the transformers when it started, but the other point was reasoning models. And that happened, I think, o one preview was a bit a year and a month ago or something like that. So we started working on it maybe three years ago, but but, you you you know, it's very recent. If if you think of it as a as a paradigm, that that's a very recent thing. So so it's always like these s curves.
Right? It it starts, then it gives you amazing growth, and then it flatlines a little bit, though yeah. We'll we'll get to the pretrained. Right? I feel pretrained in some sense is on the upper part of the s, but it's not like scaling laws for pretraining don't work.
They they totally work. Like, if you what scaling laws says is that your loss will log linearly decrease with with your compute. We totally see that, and clearly Google sees that and and all other labs. The problem is, you know, how much money do you need to put into that versus the gains you get? And it's just a lot of money, and people are putting it.
But with the new paradigm of reasoning, you can get much more gains for the same amount of money. Because it's on this, like, lower and, like there are just discoveries to be made, and these discoveries unlock insane capabilities. So so it's not like pretraining fizzled out. It's just we found out a new paradigm that at the same price gives us much more amazing development, and this paradigm is still very new. It happens so fast.
I think you if you blink, you may miss it. Because it was basically, you had chat 3.5, right, GPT 3.5 in chat, and it would give you answers, and it used no tools, no reasoning. It would answer you something. And now you have chat, and, you know, if you were not into it, you may have blinked, and it also gives you answers. And you may say, okay.
It's more or less the same, except the chat now will, you know, go look on some websites, reason about it, and give you the right answer instead of something it memorized in its weights. I very much used to like this example of, you know, when what time does the SF Zoo open tomorrow? Like, old chat would tell you. Right? Totally hallucinate from its memory an hour that it read zoo opens on probably the zoo's website from five years ago, and it didn't know what's today or tomorrow.
So it would just assume it's a weekday. Chat now knows what's today because it's in the system prompt. It goes to the zoo website, reads it, extracts the information. If it's ambiguous, probably checks three other websites just to confirm and then gives you the answer. But if you blink, you may think it's the same, but no.
It's it's dramatically better. And, you know, as a consequence, since it can read all the websites in the world, it can give you answers and and stuff that it wouldn't be able to even touch before. So so there is tremendous progress. Right? And it happened so fast that it may even be missed.
I I think I think one of the biggest things that I would say people kind of know on the inside and others don't is that already right now, it's not about the progress. Like, there are so many things chat or Gemini, any any alarm can do for you that people just don't realize. Like, you can take a photo of something broken, ask how to repair it. It may tell you. You you can give it a college level homework, and it will do it for you probably.
So so that that's absolutely amazing.
So there is an education gap to some extent?
Well, it just happened. Like, I mean, if you think you you said codex. Right? You know, programmers are conservative a little bit. I I I I still use Emacs from time to time.
All the coding tools like, okay, it will complete one line for me, but but people are very like, this is my editor. I write code here. Now people are like, no. This is Codex. I ask it to do stuff.
I will fix it later. Right? But I think it's the recent few months when the transition happened from, you know, people using it sometimes, but rarely to now basically this being how a lot of people work in coding. That that's quite big. I'm not sure everyone's aware of it, but but it's also like if you don't do programming, why would you be aware of it?
I I do believe, though, that this will come to more and more domains.
To the point of all of this being very new and somewhat sudden, something that you or I hear from time to time when talking to people is that part of the reason why people are so optimistic is that there is a lot of low hanging fruit, very obvious things to improve for those models in the next few months. First of all, do you agree? And second, can you give us some examples of, like, obvious things that you need to fix next and that the industry will fix?
Yes. There is a ton of extremely obvious things to fix. Larger part of this ton is is just hard to talk about on a podcast because it is in the engineering part. You know, every lab has their own infra and their own bugs in the code. Machine learning is beautifully forgiving in some sense in contrast to old software engineering, which would just yell at you when you made a mistake.
You know, our Python code would generally probably run except much slower and give you worse results if you run it wrong. So you realize, oh, no. It was wrong, and you improve it, and the results get better. These are huge distributed computing systems. They're very complex to run.
So so there is just a huge amount to improve and fix and understand in in the process about just how to train your model and how to do RL because RL is more finicky than pretraining. It's harder to do really right. So every day, this is our day to day work. On top of that, there is data. You know, we used to train on just like common crawl, basically.
It's a it's a big repository of the Internet that people just scraped without regard of what. And some things came in, some didn't. It it it was a mess. So now, of course, every larger company has a team that that tries to filter this and and improve the quality. It's a but it's a lot of work to to really extract better data.
Now synthetic data is becoming a thing, but when you generate synthetic data, it really matters how you do it with what model, the whole how the in the engineering aspects of everything, it's such a new domain that that that that you know, it was done somehow. It works. It's beautiful, but there is just so much to do better that the people, I don't think, have any doubts that that there is a lot there. And on top of that, there are the big things like multimodal. I mean, language models, they they are now, as as I as I'm sure you know, and and most people realize, actually vision language models because and they can also do audio.
So they're multimodal models to some extent, but the multimodal part still lags behind the text part to a large extent. So that's one big area where, obviously, you'd need to do better, and it's not a huge secret how you can do better. You know, there there are some methods that maybe will make it even amazingly better, but there are some very simple methods how you can do just better. But, you know, this maybe requires retraining your whole base model from scratch, and that takes a few months, and it's a huge investment, so we need to organize it. So so so there is a lot of just work that that will undoubtedly make things better.
I think the big question that that people have in their mind is how much better will it make them?
So I'd love to do a little bit of a deep dive slash educational part on the whole reasoning model aspect because as you just mentioned, since it's so new, some people truly understand how those work. Many people don't. At a very simplistic level, what is a reasoning model, and how is that different from your sort of base LLM?
So a reasoning model is like your base LLM. But before giving you the answer, it it thinks what people call in the chain of thought, meaning it generates some tokens, some text that's meant not for you to read, but for the model to give you the bet a better answer. And while it does this these days, it is also allowed to use tools. So it can, for example, in its thinking so called thinking process, go and browse the web and to give you a better answer. So so that's the superficial part of the thinking models.
Now the deep part is that you start treating this thinking process as part of the model, basically. So it's not something the model generates and and it's an output for you. It's something you want to train. Right? You want to tell the model, you you should think well.
You should think so that the answer after this is good in in whatever way. And this leads you to a very different way of training the model because models were usually trained with just gradient descent, the way deep neural networks are trained. Meaning, you say predict the next word, and you do a gradient. You differentiate your function from model. They're not fully differentiable, but you approximate it, and you trained your weights to to do that.
And that it was quite amazing that doing just that, you could make a chat. But with the reasoning model, you you can do that because there is this reasoning part that you can differentiate through that. So we we train this with reinforcement learning. And reinforcement learning basically tells you, okay. There's just this reward, and you need to do a bit of tries and reinforce, meaning push the model towards doing more of the things that lead to better answers.
And this kind of training is a bit more like, it has more restrictions than the training we used before. The training we used before, you took all of the Internet, put it in. Even if you didn't filter it very well, it would mostly work. Reinforcement learning, you need to be careful. Like, you need to tune a lot of things, but you also need to prepare your data very carefully.
So currently and current for at least the most basic ways we use it currently, it needs to be fairly verifiable. So there is an is your answer correct or not? You prepare data for that. You can do that in mathematics, coding very well. You can do this in science to some extent.
Right? You can have test questions that are correct or not. But, you know, if it comes to, like, writing poems, is this poem good or not? It's for now, the reasoning models are really shine in in domains like science, and they've brought some improvements to to non science domains, but but it's not quite as huge maybe yet as it as it could be. But, mean, at least compared to mathematics and coding.
Then there is the multimodal question. How do you do reasoning in multimodal? I think this is starting, like, I saw some Gemini creating images in the reasoning part. That's quite exciting, but it's very, very yeah.
Pre training and reinforcement learning parties is particularly interesting from an educational point of view again, because it seems that people have come to the conclusion that there is the pre training world and then the post training world. And the post training world is mostly reinforcement learning. But this idea that there is reinforcement learning in the pre training, I I don't think is as understood by everyone.
At the beginning of chat, let's say, there was pretraining. People did not do RL. Right? Then but then you couldn't really chat with it. So chat was RLHF applied to a pretrained model, but the RLHF was was different kind of RL of sorts.
Right? It was, like, very small, and it was human preference that was telling you what is better. That's what the I HF is human feedback. Right? You you showed people, like, pairs of stuff.
You learned a model that says, well, people seem to prefer this as an answer. You trained with that. It would very quickly, what we today say, hack this model. Like, if you trained the RLHF too long, it would start giving things that satisfy this model that seems supposed to model human preferences. So it was a bit of a brittle technique, but it was a bit of a RL that was extremely crucial to making the models chat.
These days, I think most people move towards this big RL. It's still not as big as pretraining in the scope, but but it it says you have a model that says whether this is correct or not, or if it's a preference, it it it it's a very strong model that analyzes things and says you should prefer that. And you have data that's restricted to a domain where you can do this well, and then you can also put some human preferences on top, but but you make sure that that you can run a little bit longer without making this whole grading fall apart. But, again, this is this is our route today. I I do believe the role of tomorrow will will be broader.
It will work on general data. And maybe then it will expand to, like, domains that that go beyond where where it shines today. Now will it shine there? Is it is is a different question. Do you really need to think very much doing some of the things?
Maybe not, but maybe yes. We maybe we think maybe we do more thinking and reasoning than we than we kind of consciously call thinking.
What would it take for RL to generalize? Is that better evaluations? Like, you you guys released a GDP val a few weeks or months ago to sort of measure performance against sort of broad economic sectors. Is that is that part of what the system needs?
I think this is a small part of it. I I I think that's one part. But but if you think of economic tasks, you know, making slides is important there, following instructions, doing calculations, it's not math, but it's still very verifiable. Right? What what I'm thinking about is when you do pretraining, you take the Internet and you just say, ask what's the next word.
You know, you could think before you ask what's the next word, obviously. Now you don't want to think before every word, probably, but I don't know if you've ever looked at the training data for, like, a real pretraining run. Because I think people mostly don't realize how bad this is. Like, hotels.com is a great website compared with the average chunk of 2,000 words from the Internet. It's it's a mess.
Right? And also a miracle that from this, the pretraining process gets you something reasonable. So you probably don't want you know, imagine you have a hotel website telling, you know, it's a beautiful vacation. You don't necessarily want to have a very long chain of thought before that. Right?
If it was written by a person, there was probably some kind of thinking that went into it. Maybe maybe not as elaborate as the math and coding thinking, but maybe there was something going on. So maybe you want a little bit of thought before at least some of the text, and that our models can't do very well yet. I I think they're starting it it it there is a lot of generalization in in this reasoning. If you learn to think for math, you can you will sometimes do some you know, some strategies are they transfer very much, like, look up on the web and see what they say and use that information.
So some of these things are very generic, and they start to transfer. I feel like some are maybe not yet especially thinking in in the visual domains is very undertrained, I believe. But, you know, we work. So so so so we will try to try to push for for more of that.
Going back to chain of thoughts, how does that actually work? How does the model decide to create that chain of thoughts? And is what we see, so the little, you know, intermediary steps that we see on the screen as users, the the chain of thought that's exposed to us, is that what's actually being processed by the model, or is there a deeper, longer, broader chain of thought that happens behind the scenes?
So in the current chat, GPT, you will see a summary of the chain of thought on the set. So there is another model that takes the full chain of thought and shows you a summary because the full ones are usually not very nice to read. They're more or less to say the same thing, just just the more messy words. So so so it's better to have a more readable summary. When you start with a chain of thought, the the first paper about chains of thought, you basically just ask the model, please think step by step, and and it would it would think.
So if you just pretrain a model on the Internet and ask it to think step by step, it will give you some chain of thought. Interesting and most important point is you don't stop there. You say, okay. So you have you start with some way of thinking, and then you say, sometimes this leads to a correct answer, and sometimes it leads to a wrong answer. So now I'm telling you, I have some training examples.
You will think a 100 times and say, 30 lead you to the correct answer, then I'll train you on these 30 examples, say, this is the way you should be thinking. That's the reinforcement learning part of training. Changes dramatically how the models think. We see this for math and coding, but but the big hope is it could also change how the models think for many other domains. Even for math encoding, you you start seeing that the models start correcting their own mistakes.
Right? Earlier, if the model made a mistake, it generally just tell you what it did and insist that the mistake was right or something like that. With with with the thinking, it's like, oh, I often make mistakes, but I need to verify and correct myself to to give the correct answer. So so this just emerges from from this reinforcement learning, which is beautiful. Right?
It's clearly a good thinking strategy to to verify what you want to say. And if you think it may be an error, then think again. That that's what that's what the model learns on the most abstract level.
Great. Thank you for this. Alright. As a quick detour, and we'll go back to more frontier AI topics, I'd love to talk a little bit about your story. I mean, you have the incredible distinction of having been at the forefront of this industry, both the Transformers paper, which was the birth of one paradigm, and now you're very much leading the charge on the reasoning model part, which is another paradigm.
So this is just incredible story. How did you become an AI researcher?
I was a mathematician and a computer scientist, but but in theoretical computer science.
And that started in high school as a kid?
Yes. I was definitely very into math in high school and into computers also later in high school. Yes. I did my studies in Poland. I went for a PhD in Germany.
It was theoretical computer science and mathematics PhD, so I very much I'm a mathematician. Yeah. I I I was always fascinated by, you know, how how how how is this thinking going? What is intelligence? As as a child, I always wanted to, like, emulate the brain.
They thought, well, okay. Maybe higher level explanations are more interesting. I did research in logic, but a little programming. But then there was this opportunity to join Google just as the deep learning was starting off. I already had my tenured position in France, and the French system has this beautiful thing that you can take a leave of ten years.
Yes. And and you can still return anytime you want. So so it's no risk.
So at some point, when you've when you've sold AGI, you may return back to France and be a professor?
Well, if you solve AGI, they may take you anyway. The the nice part about the leave is that they will take you back even if you don't. But it's it's actually very important. A number of I I think there's a number of Nobel Prize winners who who took this leave to just try something more risky. And, you know, sometimes it works.
Sometimes it doesn't. There there there is a lot of luck in in in science and research, but it's very good to have this opportunity to take it. So I came to Google.
And that was Google Brain at the time, you said. Right?
I came to Ray Kurzweil's group. He was my first manager. He interviewed me and was very inspiring. I I I was my first interview was to join, like, the YouTube UI team, and I was like, okay. I'm not going.
And and then I had an interview with Ray. I knew him, of course, from his books, and and and he's a very inspiring person. So I was like, okay. Let's go. The team was separate from Google Brain at that time.
Then I moved to Google Brain, worked with Ilia, discover another very inspiring person. There's an amazing number of great people in in in AI in The Bay in general.
I have to ask you at this point about the Transformer Paper story, how it all came about. The eight of you, right, seven or eight of you, how did you all get together?
Well, we never got together.
You never got together. Okay.
I I recently got a photo on Twitter of a photo session of all eight of us, and it it it's it was saying it was fake, but I knew it was fake because I don't think all eight of us were ever in the same physical room. These ideas developed from many sites before and after, like, Jacob were worked and and Dilipol Polushukin worked on attention, the, like, self attention. Of course, attention was there from the encoder decoder side with
And maybe maybe one minute for the broad public on on what attention actually means since it's such a fundamental concept.
So so so attention is the mechanism that tells the model that as as you're doing the next thing, look into your past and find the most similar things that you see in the past to to what you are seeing right now. It came from the machine translation times where people wanted to align words in one language with words in another. They were like, okay. So this word, where in this previous sentence would it be? It's an analog of alignment for deep learning.
It's now called attention in in in AI. Just says, you know, think of, like, what comes to your mind as you are here now in in this environment? What things from the past are are similar to? And this mechanism was already used in in in deep learning translation before, but it was used there was, like, one encoder model, and the decoder would be looking at the encoder, but never at its own states. Main novelty of transformer was self attention.
But transformer is more more than just this idea. I I think that's important. I think that's the that's the beauty of these eight weird people that somehow came together even though not physically to to do it is that we all approached it from different sides. So so so there was people working on the attention idea. There is the you know, you need to put this in the network that needs to have a lot of knowledge.
So so there is the feed forward layer that expands and then contracts. Zunam was working on this and nowadays used mixtures of experts which actually came before the transformer. So so how do you store knowledge in neural networks is is another important question, and it's part of this model too. And then, you know, in deep learning, people laugh that ideas are cheap. Making them work is is the hard part.
So how do you write the systems and the code and the baselines to actually make this train? And and this is funny to say now because nowadays, you can take any deep learning framework and say, you know, x equals transformer, x train, and it will basically work. But back then, it totally did not. So you need things like learning rate warm up or or tweaks to the optimizer that that were just work. And I did a lot of coding, and at that time was working on TensorFlow and parts of the framework.
And I I remember distinctly that people were like, so you want to use the same model for a few different tasks? Like, why do you even do that? Like, if you have a different task, like, you do translation, you train one model. If you do parsing, you train another. If you do image recognition, you train a third.
Like, you never train the same model for three different tasks. Why do you even write, like, APIs to to do multiple tasks on one model? And I was like, no. No. We're gonna do all tasks in one model.
And people were like, no. No.
So there was a lot of pushback against the idea?
Not against the idea. Google was also an amazing place at that time that they would very happily let you work on whatever you wanted. But but I don't think there was widespread belief in doing multiple tasks with the same model, not to mention, you know, this idea that you'd I I still find this idea that you take basically the same model as transformer. Like, now there is a bunch of changes to it, but you could in principle take the same architecture as the decoder from the paper, train it on all of the Internet, and it will basically start chatting with you. That would have back then definitely sounded as a worthy dream.
We maybe had as a dream, but not to the reality that you expect five years later. It's very lucky that it actually works so well. Right?
Talk about the transition from, from Google to OpenAI and perhaps how those two cultures are different.
So Ilya Sutzkever was my manager at Brain, and he he went on to to found OpenAI. He asked me a number of times whether I would like to join in in the years. I found it a little bit too edgy at the time that transformers came, so we had a lot of work with that. And then COVID came. And and COVID was a tough time for for for the whole world.
Right? But Google totally closed. Google was reopening extremely slowly. So one part of me was I find it very hard to do remote work. I much prefer to work with people directly.
That was one reason. But the other was also Google Brain when I joined it was a few dozen people, maybe 40, something like that. When when I left, it was 4,000, 3,000 people spread across multiple offices. It it's very different to work in a small group and to work in a in a huge company. So with all this, Celia was like, you know, OpenAI, though, is in a much stabler state.
We're doing language models. You know something about this. That may look like a good match. And I was like, okay. Let me try.
I've never worked in any company other than Google before, other than the university. So it was quite a change to the small startup group, but I I I I like working in smaller groups. It's it it it has its pleasure. Right? It has a little bit of a different in intensity sometimes.
In general, I I I found it very nice. On the other hand, Google, you know, has merged, made the Gemini, and I hear it's also a very nice place. I I think in in general, the the tech labs are more similar to each other than people think. There are some differences, but but I think if if I look for at it from the world, you know, from the university in France, the difference between this university and any of the tech labs is much larger than than between one lab or the other.
How are the research teams organized within OpenAI?
They're organized. They're not very organized. I mean, we do organize them, but some, you know, have managers and sometimes talk to them. No. But mostly people find, like, projects, there are are things to do.
Right? Like improve your multimodal models, improve your reasoning, improve your pretraining, improve whatever this part of the infrastructure. People work on that. You know, as we go through these parts, there is infrastructure, pretraining, reasoning. I think the parts are the same for most of the labs.
There will be teams doing these things. Then sometimes people change teams, sometimes new things emerge. There is always some smaller teams doing, like, more adventurous stuff like diffusion models at times. Then, you know, some of the more adventurous stuff like video models gets big, and and and then maybe they need to grow.
Do people compete for GPU access?
I don't think it's so much people that compete. I think it's more projects that compete for GPU access. There's definitely some of that. On the other hand, like, on the big picture of GPU access, a lot of this is just determined by how the technology works. Right?
Currently, pretraining just uses the most GPUs of all the parts, so it needs the most GPUs. Right? RL is growing in the use. Now video models, of course, use a lot of GPUs too. So you need to split them like this.
Then, of course, you know, people will be, oh, but my thing would be so much better if I had more GPUs. And I've certainly said that a number of times too. So then you kind of push it. You know, I really need more. And then some people may say, well, but, you know, there's only so much.
There is never enough GPUs for for everyone. So there is some part of the competition, but the big part is just decided by the fact how the technology works currently.
Great. What is next for pre training? We talked about data. We talked about engineering, your big GPU compute aspect to this. What happens to pretraining in the next year or two?
Pretraining, as I said, I I think it has reached this upper level of the s curve in terms of science, but it can scale smoothly. Meaning, if you put more compute, you will get better losses if you do things right, which is extremely hard, and that's valuable. You don't get the same payoff as as as pushing Karel, but it generally just makes the model more capable. And that's certainly something you want to do. I think what people underestimate a little bit in in in the big narrative is, you know, OpenAI three, four years ago, I joined even before that, was a small research lab with a product called API, but, you know, it it was not such a big there was no GPU constraint on the product side, for example.
All GPUs were just used for training. So it was very easy as a decision for the people to say, you know, we're gonna train GPT four. This will be the smartest and largest model ever. And what do we care about small models? I mean, we care of them as to make like, to debug the training of the big model, but that's it.
So GPT four was the smartest model, and it it was great. Right? But then it turned out, oh, there is this chat, and now we have a billion users. And, you know, people want to chat with it a lot every day, and you need GPUs. So you train the next, like, huge model, and it turns out you cannot satisfy this.
Like, people will not want to pay you enough to chat with the bigger model. So you just economically need the smaller model. So and this happened, of course, to to all the labs because, like, the moment the economy arrived and it became a product, you had to start thinking about price much more carefully than before. So I think this caused the fact that instead of just training the, you know, largest and largest thing you can for for the money you have, we said, well, no. We we're gonna train, like, the same thing, but same quality, but smaller, cheaper.
The pressure to to to give the same quality for less money is very large. In in some instances, researchers almost makes me a little sad. I have, you know, a big laugh for these huge models that you know, people say human brain has 100,000,000,000,000 synapses or orders of magnitude, of course, are not that exactly calculated, but our models don't have a 100,000,000,000,000 parameters yet. So maybe we should reach it. I would certainly love it, but then you need to pay for it.
And so I think this is this may be why people kind of think that pretraining has paused because a lot of effort went into training smaller models. Now on the side, people kind of rediscovered how amazing distillation is. Distillation means you can train a big model and then put the same knowledge from the big the big model is a teacher to the little model. People knew about distillation. It's a paper for you all a long time ago.
But somehow, at least for OpenAI, think maybe it was more in Google's DNA when Oreo's there. But people kind of rediscovered how important that is for the economics. But now it also means that, oh, training this huge model is actually good because you distill all the little ones from it. So now maybe there is a bit more of a return to Lake. It's also matter you know, once you realize you have the billion users and you need the GPUs, you need to invest into them.
Of course, we're everyone sees this. There's a huge investment, but the GPUs are not online yet. So when they come back online, and I think this may play into this what people call resurgence of pretraining, we both understand that you can distill this amazing big model, and there is now enough GPUs to actually train it. So so so it's resurging. But all of this fundamentally happens on the same scaling curve.
Right? It's it's not like we didn't know that you could do this. It's more like the different, you know, different requirements of different months have have changed sometimes changed the priorities. But I think it's good to step back from it and and and think of the big picture, which is that pretraining has always worked. And and the beautiful thing is even stacks with RL.
So if you run this thinking RL process on top of a better model, it works even better than if you run it on top of a of a smaller model. So so
One question that I find fascinating as I hear you speak and the evolution of the modern AI systems has been this combination of LLMs plus RL plus a lot of love things going on. It used to be at some point, and maybe that was back in the deep learning days that people would routinely say that they understood how AI worked at a micro level, like the matrix multiplication aspect, but didn't fully understand once you had everything together, what really, really happened at the end of the day in the model. And I know there's been tons of work done on interpretability over the last couple of years in particular, but particularly for those very complex systems, is it increasingly clear what the models do or is there some element of black box that persists?
I would say both. There is a huge progress in understanding models. Fundamentally, I mean, think of the model that is chat. It it talks to a billion people about all kinds of topics. It gets this knowledge from reading all of the Internet.
Obviously, you cannot identify like, I cannot understand what's going on in there. I don't know the whole Internet. What we can identify is there was a beautiful paper just, I think, last week with from OpenAI about you tell the model that lots of its weight should be zeros, it should be very sparse, then you can really trace when it's thinking about one particular thing, then you can trace what's actually what it's actually doing. So if you say limit ourselves to this and to really study this inside the model, then then you can get a lot of understanding. And and there's there's circuits in the model.
Santropic had great papers on that. So the understanding of what the models are doing on a higher level has progressed a lot, but then it's still an understanding of what smaller models do, not the biggest ones. But it's not so much that these patterns don't apply to bigger models. They do. It's just the bigger models just do so many things at the same time that there is some limit to what you can understand.
But I think this limit is is a bit more fundamental than people think. It's like every very complex system, you can only understand so many things, and then you don't. Right?
Thank you for all of this. I'd love now to to talk about 5.1 and do a little bit of a deep dive on all the latest stuff that you guys have released in the last couple of weeks, which has been very impressive in particular as a user. I think that the 5.1 moniker doesn't do justice to the evolution between five point one and five. It feels like a much larger improvement than the number would indicate from, again, my perspective as a user. Walk us, maybe through the evolution of, from GPT four to five to 5.1.
What has actually changed?
That that's a very tough question. I think less than you think. It's, no. I mean, the from GPT four to five, I think the biggest thing that changed is reasoning, meaning RL and synthetic data. As I told you, the pretraining part in that time frame was mostly about making things cheaper, not making things better.
So, of course, the price has changed dramatically too. Right? Thousand times, I think, or some of these orders of magnitude. The the main improvements from four to five is adding reasoning with reinforcement learning, and this allowed to generate synthetic data, which also improves the model. So that's the that's the big picture.
In addition to that, ChatGPT is is now a product used by a lot of people. So the the post training team has learned a tremendous number of lessons, and it's added you know, things clearly experimented, wanted the model to be very nice to you, then turned out to be too nice. Then now when a lot of people use it, you need to be really careful about safety. Right? There may be people that are in distress using the model.
The model needs to do something reasonable in these cases. It was not trained for it before. Now it is, and and it makes the model much better. But, you know, you in the same time, you don't want to refuse to answer any question that has any sign of anything. So so as you work on these things, you make the model much better in use, not not just for the people in distress, but but for everyone who wants questions answered, but the answers to be reasonable.
And, you know, the what's these things called hallucinations? It's still with us to some extent, but dramatically less than than two years ago. Some part of that is because reinforcement learning can now use tools and gather data, and it also encourages the model to to, like, know you know, verify what it's doing. So that's an emergent thing from from this reinforcement learning of reasoning. But, also, you just add data because you realize sometimes the model should say, I don't know.
So you add this to the post training data. You you say, like, well, we really need to give it a thought how the model should answer people in in various situations. The price to find point one is is mostly this kind like, it's mostly post training improvement.
Yes. So to double click on this because it is super interesting. So indeed, as part of 5.1, there's the ability to choose different kind of styles from nerdy to professional, and that's, I guess, in reaction to the fact that some people were missing the sycophantic aspects of earlier models when GBD five came out. So adding more tones, that's all post training stuff. So you tell the model, those are examples of, like, how you should be responding, which is more like a sort of, supervised training kind of a paradigm, or is that, is that RRL, like, right or wrong with rewards?
How does that work?
I don't work on post training, and it certainly has a lot of quirks. But I think the main part is is indeed RL, where you say, okay. Is this response cynical? Is this response like that? And and and you say, okay.
If you were told to be cynical, this is how you should respond. If you were told to be funny, try this one. So I I do think the RL is is is a big part of
In between models or or or different versions of the models, are are the are the releases aligned with pretraining efforts Or sometimes you have, like, one big tribute pretraining effort and, like, several models that that come out based on that.
There used to be a time not that long ago, half a year, distant past, where where where the models were did have an alignment with with technical stuff. Right? So they would align whether with either with RL runs or pretraining runs. That's why you had a beautiful model called four o, which was aligned with the pretraining run, which was obviously worse than the o three, aligned with an RL run that was the follow-up to o one naturally because you couldn't use the name o two. But it was slightly better than the o four mini because that one was mini.
And, you know, we had this beautiful model picker, and people kind of thought this was not the best naming for some whatever reason. So, no. I mean, it was fairly obvious that this was very confusing. Right? So so so so now the naming is by capability.
Right? GPT five is a capable model. 5.1 is a more capable model. Mini is the smaller model that's slightly less capable but faster and cheaper, and the thinking models are the one that do more reason. Right?
In that sense, the naming is detached from any technical in particular, you know, 5.1 maybe just a pretraining sorry, post training thing, but maybe 5.2 is the newly pretrained model or or maybe not, but but but the naming is detached from the technology, which also gives some you know, as OpenAI has grown, there is a number of projects. Right? There is RL and pretraining and there may be, you know, something just to make slides better or whatnot. And with distillation, you have the ability to put a number of projects into one model. It's kind of nice that you don't need to wait on all of them to complete at the same time, and you can try to periodically put this together, actually make sure that there's a product that's nice to the users and good and and do this separately from, you know, waiting on the new full pretraining front that takes months and so on.
So I feel like even though, you know, a little tear in my eye goes for the times where it was that pretrained model number that that was the number. As as it's a product serving a billion users, it's maybe inevitable that you should name it by what the user should expect from it rather than
In 5.1, you have additional granularity in terms of telling the model how long it should think. By default, how does the model decide how long it should think?
So the model sees the task. It will decide on its own a little bit how long it should think. But you can put give it an additional it's trained with an additional information that it'll think even harder, and and then it will think longer. So you have now the ability to steer that. I I still think it is important to realize.
So this is the fundamental change that came with reasoning models that using more tokens to think increases your capability, and it increases it given the computation way faster than pretraining. Right? So so if you give GPT five the ability to think for long, it can solve tasks that are you know, we had these gold medal at Mathematical Olympiad and Computer Science Olympiad. So so they're amazing abilities. At the same time, the the fundamental training method of of reasoning is very limited to science data.
So so it's not as broad as the pretraining, which I I think, like, pretraining models felt kind of almost uniformly good or bad at things. I mean, this was still not uniform because it's not like teaching humans. Right? But the reasoning models are even more people call it jagged. Right?
They have amazing capabilities somewhere and then close by, not so much, and that can be very confusing. It it's something I always love that it's weird because you can say the model's amazing at mathematical Olympiad. At the same time, I have a math book for I have a first grader daughter in the first grade. She's five years old. I took one exercise from this math book, and none of the frontier models is able to solve it.
And you would be able to solve it in ten seconds. So that's something to keep in mind. Models are both amazing, and they're tasks that they cannot do very well. I I I can show you this as an example. I think it's quite interesting to keep in mind.
Let me start with Gemini three, just to blame the competitors.
Yes, please.
So it has, you see, two two groups of dots on both sides. And the question is, is the number of dots even or not? And if you look at it, you see, oh, they're, like, two identical things, so that would be even. That's what the five year old is supposed to learn. But there was one dot that's shared.
So so now that that must be odd. For this simple one, which has, like, you know, I don't know, 20 dots or so, Gemini three actually does it. Right? It it it finds out that that that it's an even number of dots and and it says that, and that's great. And then you have another puzzle, which is very similar, except now there are two mountains of dots, and and there's also one dot shared at the bottom now.
And right in context, right after that, you ask, okay, how about this one? And then it it it does some thinking, it just totally misses that there is a shared dot, and it says the number is even. And it's like in context where you've seen this first example, how would you ever miss that? You know? And here is the same the exact same prompt for GPT 5.1 thinking, and it also solves the first.
It it it it sees the dot. It says it's odd, and then it sees the mountains, and somehow it doesn't see the dot, and it says it's even. The the the nice thing is if you if you let it think longer, this is like or if you just let it think again, it will see it. So so if you use GPT five Pro, it takes fifteen minutes. So, you know, this is the the human five year old takes fifteen seconds.
The GPT five one pro will run Python code to extract these dots from an image, and then it will count them in a loop. So that's not quite
And and why why is that? What trips up the model?
I I think this is mostly multimodal part. The models are just they're they're starting. Like, you see the first example they manage. So so they've clearly made some progress, but they have not yet learned to do good reasoning in multimodal domains. And they have not yet learned to use one reasoning in context to do the next reasoning.
What is written in context is is, you know, learning in context happens, but learning from reasoning in context is still not very strong. All of these, though, are things that are very well known and, like, the models are just not trained enough to do this. It's it's just something we know we need to add into training. So I think these are things that will generally improve. I do think there is a deeper question whether so, you know, they're like, multimodal will improve.
This will improve. Like, we keep finding these examples. So as the frontier will move, it will certainly move forward. Some things will smoothen. But the question is, will it still be just other things that you don't need to, you know, teach the human, like, every you know, okay.
Now you know how to use a spoon and a fork, but but now if the fork has four instead of three ends, then you need to learn a new. That would be a failure of machine learning. You know, I am fascinated by generalization. I think that's the most important topic. I always thought this was the key topic in machine learning in general and in understanding intelligence.
Pretraining is a little different, right, because it increases the data together with your increase in model size. So it doesn't necessarily increase generalization. It just uses more knowledge. I do believe that reasoning actually increases generalization, but now we train it on such narrow domains that that it may still be to see. But I think the big question in all of AI is is reasoning enough to increase generalization, or do you need, like, more general methods?
I think the first step is to make reasoning more general, as we talked before. I that that that's my passion. That's also what I work on. There is still something there. Right?
We we push the models. They they learn they learn things that are around what we teach them. They they still have limitations because they don't live in the physical world, because they're not very good at multimodal, because reasoning is very young and there's a lot of bugs in how we do it yet. But once we fix that, there there will be this big question. Is that enough, or is there, like, something other big to to make models generalize better?
So we don't need to, you know, teach it every particular thing in the training data that it just learns and generalizes. I think that's the most fascinating question in AI, but I also think a good way to approach a question like that is to first solve everything that leads up to You know, you cannot know whether there is a wall or not until you come close to it because, otherwise, there you know, we AI is moving very fast. Someone said it's like driving fast in a fog. You never know how how far or close you are. So so we we're moving.
We are learning a lot.
And does that does that mean so so that that central question of, basically, learning with with very little data the way a child would and the fact that the child is able to do things that even the most powerful model cannot do. So this, as you said, to unpack this, making progress on reasoning and showing how far we can get into generalization with reasoning. And then the separate question is, as you as you said, whether we need an entire different architecture, and that's where we get into, for example, Yaron LeCain's work. Do you see promising fundamental architectural changes outside of transformers that have caught your attention and and feel like they could be, like, serious path to explore in the future?
I I think there is a lot of beautiful work that that people are trying out. You know, the ARC Challenges inspired one set of people that that their models now that are very small and solve them very well. But with methods that I'm not sure are actually general, we we need to see. Jan Lekun has been pushing for for other methods. So I feel like his approach is is more towards the multimodal part.
But maybe the no. Maybe if you solve multimodal right, maybe if you do JETI, it also helps your other understanding. There is a lot of people pushing fundamental science. It's maybe not so much in the news as as as the, you know, the things that that push. But whatever you do, you know, it will probably run on some GPU.
If you get a trillion dollar of new GPUs, the the old GPUs will be much easier to get also for for so I think this growth in LLMAI on on the more traditional side is also helping people to to have an easier time to run more experimental research projects on on various things. So so so I think there is a lot of exploration, a lot of ideas. It's still a little hard to implement them at a higher scale. The engineering part is the biggest bottleneck. Mean, GPUs are a bottleneck too when you scale really up.
But implementing something that's larger than one machine, it's an experimental research project, so you don't have a team to do that. I think that's still harder than it should be. But, you know, codecs may may get their or code coder. This is the thing where AI researchers have great hope to help themselves and and also other researchers is that if you could just say, hey, codex, this is the idea, and it's fairly clear what I'm saying. Please just implement it so it runs fast on on this eight machine setup or or a 100 machine setup.
That would be amazing. It's not quite capable to do of doing that yet, but, you know, it's capable of doing this more and more. I think that's what what OpenAI says is they say, you know, we we say we'd like an AI intern by the end of next year. That's how I understand this. You know?
Can can someone help us?
And is is is part of Codex's the the the past four Codex to be able to do some of this, does that revolve around how long it can run? Context behind the question being that, again, like two days ago as we recorded this, you guys released GPT 5.1 Codex Max described as a frontier genetic coding model trained on real world software engineering tasks designed for long running workflows and using compaction to operate across multiple context windows in millions of tokens. So I'd be interested in unpacking some of this. What does that mean to run for a very long time? Is that an engineering problem or a model problem?
And then maybe a word on compaction.
So it is an both an engineering and model problem. You know, you want to do some engineering task, like write a have some machine learning idea. You want Codex to implement it for you, test it on some simple thing, find the bugs. So it needs to run this thing. This is not something you would do in an hour.
Right? It's that's something you'd spend a week on. So the model needs to spend a considerable amount of time because it needs to run things, wait for the results, then fix them. The model is not like it's gonna come up with the correct code out from thin air. Right?
It's just like us. It needs to go through the process. And oftentimes in the process, since it was not trained on anything very long in its in its training, or maybe very few, but but nothing certainly nothing that went on for a week, it can get lost. It it can start doing loops or doing something weird. That's, of course, not something you want.
So so so we try to train in a way that makes it not happen, but but it does. So so, you know, how can you make the model actually run a process that requires this larger feedback loop without tripping cap? And the other thing is transformers have this thing called context. So they they remember all the things that they have seen in in the current run, and that can just exceed the memory available for your run. And and the attention matrices are are n by n where n is this length, so so they can get huge.
So instead of keeping everything, you say, well, I'm gonna just ask the model on the side to summarize the most important thing from the past, put it in context, then you and forget some part of it. Right? So it's a very basic form of forgetting the compaction. Right? And that allows you to run for much longer if you do this repeatedly.
But but, again, you need to train the model to do that. And and when you do it, it works to some extent. I don't think it works well enough to replace an AI researcher yet. It made a fair bit of progress. I I think another part of progress that that's a little understated on the research side, but is very important, is allowing the model to connect to all of these things.
So models now use tools like web search and Python routinely, but to run on a GPU, to have access to a cluster. It's hard to train models with that because then you need to dedicate the thing for the model to use and and and that has security problems. And this thing, like, how do models connect with the external world? It's a fundamentally very hard problem because, you know, when you connect in an unlimited way, you can break things in the real world. And we don't like models to break things for us.
So that's a part where people work a lot. This it overlaps with security. Right? You need to, like, have very good security to allow models to go on and train on the things they need to train.
One theme that people like me, VCs and, you know, founders and start up think about a lot as we see all the progress at OpenAI is as the models keep getting more general with more genetic capabilities, the ability to run for a very long time, going to areas like science and math. Recently it was reported that there were some investment bankers hired to help improve the model's capability to do grunt investment banking work. All of that taken into account, a is world where basically models or maybe just one model does everything? And I don't know if that's AGI, let's not necessarily go into that debate, but what's left for people that build products that sit on top of models?
I just showed you a five year old exercise that the model doesn't do. I I I think we need to keep that in mind.
You're you're saying there's hope?
There is hope the next model will do it.
Right? That hope. Okay.
Well, for for me, yes. I I still think we have some way on the models. Progress has been rapid, so so there is good hope there'll be less and less of this. But but on the other hand, for now, it's you don't need, you know, to do a deep search to find things where you'd really want a human to to do that task because the model is not super good. On the other hand, I you know, transformer paper started with translation.
I recently went to a translation industry conference. The translation industry has grown considerably since then. It has not shrunk. There is more translations to be done. Translators are paid more.
Questions, why would you even want a translator if the model's so good in most of the cases? The answer is sometimes imagine you do a listing for a newspaper, but in a language you don't know. And GPT five will almost certainly translate it correctly correctly for for you. You. If it's into Spanish or French or any high resource language, would you still publish it without having a human who speaks that language look at it?
Would you publish it if it's a, you know, UI of ChatGPT that a billion people are going to see? It's a question of trust. Probably right. But if you have a million users, a billion users, maybe you will pay the $50 for someone to just take a look over it before you you translate it. So this is in an industry that fundamentally is totally automated.
Right? There is still the question of trust, and I think that's a question we will grapple with for a long time. There are also just things you want a person to do. Like, I I I don't think we will have no things to do, But but that doesn't mean that some things we do may not dramatically change. Mhmm.
And that, you know, that can be very painful for people who do these things. And and so this is a serious topic that, like, happy people are engaging with. But I don't think like there will be this global lack of anything for people to do.
And maybe as a last question to help us get a sense for what people at the frontier of AI are currently thinking about or working on. Some of the topics that one may see are things like continual learning, world models, robotics, embedded intelligence. So what do you personally find very interesting in addition to what you mentioned upfront multimodal, but what do you personally find really interesting as a research area?
Well, you know, I I always I find this general data reinforcement learning. This is my my pet peeve and what I work on luckily. But for example, robotics is probably just an illustration that we are not doing that well in multimodal and that we're not doing that well in general in general reasoning yet. The moment we do really well in multimodal and we manage to generalize reasoning to to the physical domains that the robot needs, I think it will see amazing progress. When it does, I have a feeling given that, you know, a lot of companies are launching hardware that's kind of teleoperated or glove operated or or or something.
So my suspicion is by the moment we make this progress, which, you know, maybe maybe it would be next year, maybe it would be in a few more years, but the hardware world may be there by then, and having a robot in the home may be like a big visible change. Maybe more visible than, you know, chat. I mean, given how quickly we got used to the self driving cars in San Francisco, maybe it will be only visible for, like, the first two days and then be like, yeah. Sure. The robot's there.
It's always been cleaning since I can remember, like, the last three months. It's it's stunning to me how quickly we get used to these things. Right? The the self driving cars in San Francisco are something people got used to so quickly. So maybe this will happen for robots too.
Nevertheless, I do think it will be quite dramatic in in our perception of of the world when it happens. It hardware is hard, though. Right? Robots may have accidents in the house. You need to be very careful.
So so maybe it will take longer to to deploy them and actually make it a scalable business. We'll see. It is amazing that, you know, we are at this point where we can start thinking, yes, maybe that will come soon.
Lukas, it's been absolutely wonderful. Thank you so much for spending time with us today.
Thank you very much, Matt. Thank you for the invitation. Great to talk to you.
Hi. It's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests.
Thanks and see you at the next episode.
What’s Next for AI? OpenAI’s Łukasz Kaiser (Transformer Co-Author)
Ask me anything about this podcast episode...
Try asking: