| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
Gemini 3 was a landmark frontier model launch in AI this year — but the story behind its performance isn’t just about adding more compute. In this episode, I sit down with Sebastian Bourgeaud, a pre-t...
If I'm being honest with myself, I think we're ahead of where I thought we we could go. We're not really building a model anymore. I think we're really building a system at this point. What might be happening instead is kind of a shift in paradigm where before we were kind of scaling in the data unlimited regime, and we're kind of shifting more to a data limited regime, which actually changes a lot of the research and how we think about problems. I don't really see an end in sight for for that kind of line of work to continue giving us progress.
Hi. I'm Matt Turk. Welcome to the MAD podcast. My guest today is Sebastian Borjeau, pre training lead on Gemini three at Google DeepMind. Sebastian is one of the top AI researchers in the world and a member of the Metis list, and this is a particularly special episode because it's his first podcast ever.
We talked about how Gemini three is built under the hood, the shift from an infinite data world to a data limited regime, how research teams at DeepMind are organized, and what's next for AI. Please enjoy this great conversation with Sebastian. Sebastian, welcome.
Thank you. Hi, Matt.
So I was hoping to start this conversation with this tweet from Oriel Vigneals, who's the VP of Research and Deep Learning at Google dMind, the Gemini co lead, who said when Gemini three came out that the the secret behind the model was remarkably simple, better pre training and better post training, which when you think about the leap that Gemini three represented over the prior set of the arts sounds remarkably modest. So I was curious about your perspective. Is it as simple in some ways as as that?
Yeah. I'm I'm not sure it's a a big secret, at least from my perspective. This seems quite normal. I I think people sometimes have the expectation that from one Gemini version to another there's a big thing that changes and that really makes a big difference. In my experience, there's maybe one or two of those things that make a larger difference than other things, but it's really a combination of many, many changes and many, many things from a very large team that actually makes Gemini three so much better than the previous generations of Gemini.
I think this is probably a theme that we'll recur later, but it's really a large team effort that comes together in the release like Gemini three.
What does that tell us in terms of where we are in AI progress? What sounds from far as in sort of turning some knobs gives us such a leap. What does that mean in terms of what we can expect going forward?
There's two things. The first one is it's still remarkable how much progress we're able to achieve in this way, it's not really slowing down. There's so many of these knobs and so many improvements that we find on a day to day basis, yeah, almost on a day to day basis, that make the model better. So that's the first point. The second point is we're not really building a model anymore.
I think we're really building a system at this point. People have sometimes this view that we're just training a neural network architecture and that's it. But it's really the entire system around the network as well that we're building collectively And so that's the second part.
The big question on everybody's mind is what does that mean in terms of actual progress towards intelligence? And we don't need necessarily to go into the whole AGI thing because who knows what that means. But is the right way to think about this kind of model progress as an actual path towards intelligence versus, you know, trying to succeed on this benchmark or that other benchmark? What gives you confidence that the the core model is getting smarter?
The the benchmarks definitely keep improving. And and if you look at the problems and how the benchmarks are set up, they they are becoming increasingly difficult. And even for for me who has a background in computer science, some of the questions and all the answers, would take me a significant amount of time to answer. This is just one view. It's it's the benchmark view, and and there's some amount of we we evaluate those frequently, etcetera.
We we're being very careful about holding out the test set. But still, there's some fears often of overfitting to those and then just benchmarking is what people call this. That's one aspect. Don't think those fears are very founded. The second aspect, and that's the one that really fills me with confidence, is the amount of time people spend using the model to make themselves more productive internally is increasing over time.
Every new generation of models is pretty clear the model can do new things and help us in our research and our day to day engineering work much more so than the previous generation of models. So that aspect should give us confidence as well that the models are becoming more capable and actually are doing very useful things as well.
I'm always curious as an AI researcher who's like so deep into the very heart of all of this. If you zoom out, are you are you still surprised by where we are? Like, from your perspective, are we well ahead of where you thought we would be a few years ago? Are we on track? Are we behind possibly?
I think it's easy to say we're on track, but in hindsight, I think if if I'm being honest with myself, I think we're ahead of of where I thought we we could go. Starting work on LLMs in in 2019 or 2020, it's it's kind of hard to believe the scale of everything we're doing, but also just what the the models are capable of of doing today. If you just if you kind of looked at scaling laws back then, they were definitely pointing towards that direction, and some people really believed us deeply. I'm not sure if I would have bet a lot on that actually materializing and being where we are today. So one interesting question that follows from this is where does that take us?
If we assume the same or if we assume the same kind of progress we've seen in the last five years, I think, yeah, this is gonna be very, very cool what's gonna happen in the next few years as well.
What do you think on that front? Does that mean AI comes up with novel scientific discovery, AI wins the Nobel Prize? Like, what what where where do you think we are going in in the short term, like, sort of two to three years?
I think, yeah, that that's part of it. On on the science side, DeepMind historically has has done a lot of work, and and for sure, there's there's a lot of work in that direction as well. And I think we will be able to to make some some large scientific discoveries in the next few years. That's one side. I think on the other side is in my day to day work as well, both research and engineering, I'm very excited about how those we can use those models to to make more progress, but also to to better understand the systems we're building and and develop develop our own understanding and research further.
Yeah, there's this big theme in the industry about automation of AI research and engineering, which if you extrapolate it leads into AI 2027 kind of scenarios where there's a discontinuity moment. Just at a very pragmatic level, what does that mean using AI for your own work today, and what do you think that's going mean in a couple of years?
I think it's not so much about automation, but more about making us go faster and spending more of our time in the research part at slightly maybe higher level. A lot of the day to day work and research on language models is we're dealing with quite complex and large systems on the infrastructure level. So actually quite a bit of time is dedicated to running experiments, babysitting experiments, analyzing a lot of data, collecting results, and then the interesting part is forming hypotheses and then designing new experiments. And so the last two parts I think is something where we'll be very much involved in. The first part I think especially in the next year with more magentic workflows being enabled more and more, that should be able to really accelerate our work there.
Is your sentiment that the various frontier AI labs are effectively all working in the same direction, of doing the same thing? You know, one fantastic, but in some way perplexing thing that we all experience as industry participant observers is this obvious phenomenon of like every week or other weeks or every month, there seems to be, like, another, you know, fantastic model and we we completely spoiled. So Gemini three just came out at the same time, like, two hours ago, literally before we were recording this, GBT 5.2 came out. What do you make of that from your perspective, and how do you think that plays out? Is anybody gonna break out or effectively the industry is gonna continue with like the handful of top labs plus some newer labs that are appearing?
Well, the first question, there's definitely similarities between what the different labs work on. I think the the base technologies are kind of similar. I might be surprised if if we weren't all training transformer like models, for example, in terms of the architecture side. But then there's definitely specialization, I think, happening on top of that and different, like, maybe tree or, yeah, branches in the tree of research that are being explored and exploited by the different companies. Think historically, for example, DeepMind has and still, I think, the vision and multimodal side, we've been actually really strong and that continues to be the case today and then shows in both how people use the model but also in the benchmarks, of course.
And then, yeah, there are things like reasoning, etcetera. OpenAI came up with the first model but we also had a strand of research on that. So there's similarities, but it's it's not exactly exactly the same, I would say. For the second question, I don't know if I have a good answer. One thing that's clear is make progress on on a model like Gemini today, you do need a a very large team and and a lot of a lot of resources.
Now that doesn't necessarily mean that what we're doing today is optimal in in any form, and and some disruptive research could definitely come along and allow a smaller team to actually take over in some form. This is one of the reasons why I actually enjoy being at Google so much as well, is Google has this history of doing more explorative research and has a really high breadth of that research and that continues to be the case, mostly in parallel to Gemini, but we're definitely able to also utilise that and bring some of those advances into Gemini.
Are there other groups, whether at DeepMind or elsewhere in the industry, that are working in semi secret or complete secret in the post transformers architecture that one day something will come out and we'll all be surprised? Are other groups like that in the industry?
I believe so. There there's groups doing research on on the model architecture side for sure within Google and within DeepMind. Whether that research will pan out, it's hard to say. Right? It it is research, so very few research ideas work out.
And so in the meantime, the core advantage that one company may have over the other is just the quality of people. In the case of Google, I guess the vertical integration, that tweet from Oroyal that I was mentioning got retweeted, court tweeted by Demis Asabis, and he was saying that the the the real real secret was a combination of research and engineering and infra. So is that is that the secret sauce at at Google, the fact that you guys do the whole stack?
It definitely helps. I think it's it's an important part. Research versus engineering is also interesting. I think over time, that boundary has blurred quite a lot because we're working on these very large systems now. Research really looks like engineering and and vice versa.
I think that's a mindset that has really evolved over the last few years at DeepMind, especially where maybe there was a bit more of the traditional research mindset before. Now with Gemini, it's really more about research engineering. Part is also very important. We we are building these super complex systems. Having infrastructure that's reliable, that works, that's scalable is is key in terms of of not slowing the research engineering down.
And Gemini three was trained on TPUs, right, not on NVIDIA chips? So it's truly That's correct. Fully integrated. Okay. So I'd love to do a deep dive on Gemini three.
But before we do that, let let's talk about you a little bit. So you are the pre training lead on Gemini three. What does that mean? And then let's go into your your background and your story.
I am one of the the Gemini pretraining leads. So what this entails, it it's a mix of of different things. So part of my job is is actual research, trying to make the models better, these days it's less running experiments myself but to help design experiments and then review results with people on the team. That's the first part. The second part, which is quite fun, is more of the coordination and integration.
It's a fairly large team at this point. It's a bit hard to quantify exactly but maybe 150, 200 people that work on a day to day on the pre training side between data, model, infrastructure, evals. So coordinating the work of all of these people into something that we can build together is actually quite complicated and takes quite a bit of time, especially time to do well. To me this is super important because actually being able to get progress out of everyone is really what makes us make the most progress rather than enabling maybe one or two or a small group of 10 people to run ahead of everyone else. That might work for a short period of time, but over longer periods of time, what's really been successful for us is being able to integrate the work from many, many people.
So in terms of your personal background, I'm always curious where did you grow up? What kind of, you know, kid and teenager? Where were you always trying to like reverse engineer, like, you know, top AI researchers? Where do they come from and how did they become or why did you become to be who you are?
I grew up a bit all over the place in Europe. Moved around quite a bit. I was actually born in The Netherlands and I moved when I was seven to Switzerland. My dad is from Switzerland and my mom is from Germany. So I did most of my school and the beginning of my high school in Switzerland, mostly in French and also in German and parts.
And then at age 15 I think I moved to Italy where I finished my high school till around when I was 19. And at that point I was going to go to the ETH in Zurich to do my studies, but I think just by random events one morning I just looked up the top universities in some kind of ranking and I saw Cambridge was at the top so I thought I will just apply, why not? And yeah, a few months later I got the acceptance letter, so I decided to to move to Cambridge where I did my undergrad and master's in the computer lab.
Yeah. And you were growing up, you were just a super kind of like math strong kind of a kind of kid, computer science kind of a kind of kid?
My dad has a technical background. So I remember some when I was 10 or 11, starting to program a bit with him and then and learning. And I kind of always liked that. And then I always had like easiness in math and science at school. I remember never having to really study for math exams, but it's always doing quite well.
That definitely changed at university. But that was that was, yeah, that was my my high school experience.
Great. And what was your path from school into where you are today?
Yeah. So that's, again, that's a bit of a lucky moment, would say. One of the lecturers we had in my master's was someone who was also a researcher at DeepMind. And I just remember at the end of the last lecture, I was packing my stuff and I was, you know what? I'll just ask him for a referral.
And what's the risk. Right? He might just say no, but whatever. And so I actually took the courage and then went up to him and asked if he would give me a referral. Sure enough, he was like, sure.
Send me your CV and I'll see what I can do. And and that's kind of how I got my my interview at DeepMind. This was in 2018. And I joined DeepMind at the time, just DeepMind, not Google DeepMind, as a as a research engineer after university.
And what did you do at first, and how did that evolve to being one of the pretraining leads on Gemini three?
Yes. So it's it's at the beginning, having joined DeepMind and DeepMind being known for RL, the first project I I managed to to work on or decided to work on was something on on the RL side. So specifically we're training some unsupervised network to learn key points on Atari environments and try to get the agent to play Atari. Right? So I did this for about six months maybe.
It wasn't enough for in the sense I didn't like the synthetic aspect of this. I always wanted to work more on real world data and have more of a real world effect. Think in general I like to build things and build things that work. I don't really like the academic pure research part. And so that kind of drove me to start working on representation, so creating these or train these neural networks that have good representations to do different tasks.
And one funny anecdote here is something I tell a lot of people on my team, the first effort I joined on this was called representation learning from real world data. And at the time we had to add this from real world data to the name of the project because people would assume otherwise it would be synthetic environments or synthetic data. And that definitely has shifted completely since then. So yeah, that was kind of my first project on that side and specifically LLMs and transformers. We are looking at architectures like transformer and then models like BERT and XLNet that were learning these representations and trying to improve those representations and do research on that side.
Great. And then you worked on retro. Right? Do do you wanna talk about that?
Yeah. So after that, we started working on on scaling up LLMs, LLMs in general. So we started this work first on Gopher, which is I think the first DeepMind LLM paper that was published. Already at that point, it was a team maybe of ten, twelve people. So already at that point, it was pretty clear you needed you couldn't just do that research on your own.
And this is really where I started doing pre training and pre training at scale and developed my research taste, but also what I enjoy about this. So we trained the first dense transformer model that was two eighty billion parameters, I think 300,000,000,000 tokens at that time, and and and trained that, and we were definitely we would definitely not do things like we were doing them back in the day, but it was it was great and and a very fun learning experience. After that, there were kind of two two projects that emerged. The first one was Chinchilla, and and the second one, Retro. So in Chinchilla, we were kind of we were reexamining how you should scale the model size and how you should scale the data, especially from a training compute optimal perspective.
The question is, you have a fixed amount of training compute, how do you train the best possible model? Should you increase your model size or should you increase your data size? And there was some previous work in this domain from OpenEye specifically that we reexamined and we actually found that you want to scale the data side much more quickly than what was thought before rather than scaling the model side. Funnily enough, this is still really relevant in our day to day work today, especially because it has a lot of implications on the serving cost and how expensive it is to use the models once they're trained. So that was one side.
The other line of work was more on retro and this is more on the architectural innovation side of things. Here we were looking at how you can improve models by giving them the ability to retrieve from a large corpus of text. So rather than having the model learn and store all the knowledge in its parameters, you give the ability to the model to look up specific things during training but also during inference.
You used the word research taste, which I think is super interesting. What does that mean? How would you define that and how important is that for a researcher?
Yeah. It's it's very important these days, and it's quite hard to to quantify. But the the few things that that matters, the first one maybe is your research is not stand alone. This is what I was mentioning before, but your research has to play well with everyone else's research and has to integrate. Right?
So let's say I have some improvement on the model, but it makes the model 5% harder to use for everyone else. This is probably not a good trade off, right, because you're gonna slow down everyone else and their research, which would then curatively slow down the durable research progress. That's the first thing. The second thing is being allergic to complexity. But complexity is quite subjective in terms of what people are familiar, but still we have a certain budget of complexity we can use and a certain amount of almost research risk we can accumulate before things go bad.
And so being aware of that and managing that is very important. Oftentimes we don't necessarily want to to use the best, best performance version of a research idea, but we'd rather trade off some of the performance for a slightly lower complexity version because we think that will allow us to do more and more progress in the future. So so these are kind of the main two things I think around research taste.
That's fascinating. And and then presumably a part of it has to do with having an intuitive sense for what may work and not work, right, given there's only so much compute you can use. Is that fair?
Yeah. Definitely. That's that's also an important part. I think that's some people have that much more than others, and and a lot of experience really helps. But for sure, we we are able to act on the research side by compute.
If we had a lot more compute, think we'd make a lot more progress, a lot quicker. And so you have to guess to some extent what the right first, like, part of the tree, of research tree you want to explore, and then within that, what are the right experiments. But then also knowing research always most research ideas fail. Right? And so you need to figure out at what point have I done enough in this direction to know to move on to something else or should I keep pushing?
And then the other interesting thing is, especially in deep learning, a negative result doesn't mean something doesn't work. It means you haven't made it work yet often. And so being aware of that as well is quite tricky.
Since we're on the on this topic of research and how to organize research team to be successful, let let let's double click on on some of this. So you mentioned trade offs. Presumably one kind of trade off is short term versus long term. How does that work? How do you all think about that?
This is part of what I spend a lot of time thinking about as well. There's always critical path things to be done or like this part of the model needs improving or we know this part of the model is suboptimal. So we invest quite a lot in just fixing those immediate things. There's a few reasons for that. The first one is we know this will make the model better, so it's a fairly safe bet.
But also we know that things that don't look quite good or quite perfect often tend to have issues later, either when you scale up or when the model just becomes more and more powerful. And so actually really having being very diligent about tackling those and fixing those is is really important. So that's kind of the the first part. The second part is study more exploratory research, ideas that could land in the next version of Gemini or or the version after that that had maybe a bit bigger effect on the model performance but aren't quite validated. How we balance these is I don't think I have a very clear answer.
It's also a bit periodical. So when we're doing a scale up, for example, there's often more certainly more exploratory research because there's nothing right now that needs to be fixed in parallel. But just before we are ready to scale up a new architecture or a new model, it's very much like let's de risk the last pieces. It's very execution focused.
How does that work a little bit, you know, in the same vein, the tension between research and product? So as we're discussing earlier, you're all in this constant race with like other labs. And so is there presumably some pressure in like, oh, no, no, we need to, you know, have a better score or like win IMO or whatever it is, so like a very pragmatic immediate product goal versus stuff that we know is going to improve the model over time? Like how does that how does that work? I guess it's it's just a variation of the same theme.
This is why I like Google as well. There's actually very little of that, I think, because all all of the the leadership has a research background. And they're very much aware that, yes, to some extent you can force and accelerate specific benchmarks and certain goals, but in the end the progress and then making the research work is really what matters. I personally, at least on a day to day, I never really feel that pressure.
How is the team at DeepMind organized? So you mentioned pre training several 100 people, if I heard correctly. Is there like a then a post training team? Is there like an alignment team? How does everyone work together?
At a super high level, so we have a pre training team, post training team. On the pre training side, we have people working on the model, on the data, the infrastructure. EVARs as well, very important. I think people often underestimate the importance on EVARs research, it's actually quite hard to do this well. And then, yes, there's a post training team, and of course, there's a large team working on infrastructure and serving as well.
All right. Thank you for that. Let's switch tacks a little bit, and as promised, let's go fairly deep into Gemini three, if you if you will. So Gemini three under the hood, the architecture, deep think, the pre training, data scaling, all those good things. So starting at a high level on the architecture.
So Gemini three, which, you know, as a as a devoted user feels very different from from 2.5. Was there a big architecture architectural decision that explains the difference? And then how would you describe that architecture?
At the high level, I don't think the architecture has changed that much compared to the the previous one. It's more of what I was saying before where few different things come together to to to together give a a large large improvement. At a high level though, it's a mixture of expert architecture, transformer based. So from that perspective, if you squint enough, you will recognize a lot of the original transformer paper pieces in that.
Yep. Can you describe for people to make this educational what an MOE architecture is?
At a high level, the transformer kind of has two blocks. So there's an attention block, which is responsible for mixing the information across time, so across different tokens. And then there's the feedforward block, which is more about giving the memory but also the compute power for the model to make these inferences. Those operate on a single token at the time, so they operate in parallel. So in the original transformer architecture, this is just a single hidden layer neural network, so it's a dense computation where the input gets linearly transformed into a hidden dimension, you apply some activation function, that one gets linearly transformed again into the output of the dense block.
So that's the original paper. Then there's a lot of work before transformers as well on Mixedron experts. And here the idea is you kind of decouple the amount of compute you use with how large the parameter is to use that. And so you dynamically route effectively to which Expert you want the computational power to be used on rather than having that coupled.
Gemini is natively multimodal. In practical terms, what does that actually mean for the model to think about text, images, or or videos?
Yeah. What this means is that there's no specific model trained to to handle images and a deaf different model trained to handle audio, a different model trained to handle text. It's the same model, the same neural network that processes all these different modalities together.
Presumably there is a cost aspect to this. Does being natively multimodal mean you're more expensive from a token perspective?
Yeah. This is a really good question. There's kind of two costs to this. I would say that the benefits largely outweigh the cost here, and this is why we train these models. But the first cost is maybe less obvious to people, but it's this complexity cost and this research that I was talking about.
Because you're doing a lot of more things and especially different modalities interact in some ways, This can interact with different parts of the research and has a complexity cost, we have to spend time thinking about these things. The second cost is, yes, images are often larger in terms of input size than pure text. So the actual computational cost is, if you do it naively, is higher. But of course then there's interesting research to be done on how you make these things efficient.
Alright. Let's talk about retraining since it's the area that you cover in particular. So starting with the high level question, we mentioned, of course, the term scaling laws towards the beginning of this conversation. We talked about Chinchilla a few minutes ago as well. In 2025, there was this much discussed theme of the death of scaling laws, particularly for pre training.
Is GEMINI three the answer that shows that all of this is not true and that indeed the scaling laws are are continuing?
Yeah. The discussions they have to me were always slightly strange because my my my experience didn't match those. I think what we've seen is scale is is a very important aspect in in pretraining specifically in in how we make models better. I think what's been the case, though, is that people overvalued that aspect. So It is a very important aspect, but it's not their only aspect.
So scale will help to make your model better. And what's nice about scale is it does so fairly predictably, that's kind of what the scaling laws tell us is as you scale the model, how much better will the model actually be. But this is only one part. The other parts are architecture and data innovation. These also play a really, really important part in the performance of pre training and probably even more so than pure scale these days.
But scaling is still an important an important factor as well.
Right. And we're talking about pre training specifically. Right? Because this year, we seem to have scaled RL in in in post training and scaled test and compute, all those things. But, like, for pre training, you're seeing not only scaling laws, not slowing down, but, like, you see some some acceleration.
Is that is that do I understand this correctly due to data and different architectures?
I think the the way to put this is is these all compound. So scale is one axis, but this model and and data also will make the the the actual performance better. And, yes, sometimes it's the the the innovation part outweighs the benefits of scaling more, and sometimes just raw scaling is the is the right answer to make the bone the model better. So that's on the pretraining side. And and, yes, on on the RL and RL scaling side, I think we're we're seeing a lot of the same things we're we're seeing in pre training or what we saw in pre training.
What's interesting here is because we have the experience of pre training, it's a lot of the lessons apply, and then we can reapply some of that knowledge to to RL scaling as well.
Speaking of data, so what what is the pre training data mix on Gemini three? I think you you guys had a a model card out for a bit that that talked about some some of this. So what what what went into it?
Yeah. It's a mix of of different things. So the the data is is multimodal from the ground up. And, yeah, there there's many different sources that go into this.
Another classic question in this whole discussion is, are we about to run out of data? So there's always the do we have not enough compute? And the other question is, do we not have enough data? Clearly, there's been a a rise in the usage of synthetic data this year. In your day to day work or perhaps in general, where do you think synthetic data helps and where does it not help?
Yes. So synthetic data is interesting. You have to be very careful in how you use it because it's quite easy to use it in the wrong way. What's often the case as well with synthetic data is you use a strong model to generate the synthetic data and then you run smaller scale ablations to validate the effect of the synthetic data. But one of the really interesting questions is if can you actually generate synthetic data to make a model that you want to train in the future which will actually be better than the model that generated the synthetic data in the first place?
Can you actually make that one better as well? And so we spend a lot of time thinking about this and doing research in this direction. The other part of your question, are we running out of data? I don't think so, sir. There's more.
We we can we are definitely working on that as well. But more than that, I think what might be happening instead is kind of a shift in paradigm where before we were kind of scaling in the data unlimited regime where data would scale as much as we would like, and we're kind of shifting more to a data limited regime, which actually changes a lot of the research and how we think about problems. One good analogy of this is before LLMs a lot of people were working on ImageNet and other benchmarks and there was very, in a very very data limited regime as well, so a lot of techniques from that time started to become interesting as well.
And and perhaps that's one of those, and I don't know to which extent you you can talk about it, if not talk about it in in general, but there is this concept throughout the industry of training models based on reasoning traces. So basically forcing the model to show its work, how it got to a certain outcome, and then taking that to train the next model. Is that something that you do or that you think is interesting or a future direction? What is your perspective?
Yeah. Unfortunately, I can't comment on the specifics. This how
I'm asking the right questions. But maybe in general, is that something
that I people in the industry believe so. And this also falls into the previous question around synthetic data you were you were asking and kind of our approach to that is similar.
And perhaps without taking this into a futuristic conversation, but like another big question and theme seems to be indeed how can models learn from less data, which I think what is what you were alluding to talking about a data limited regime. Again, at DeepMind or in general, are you seeing interesting approaches, to use the famous analogy, a model can learn like a child does?
Just to maybe clarify what I said earlier, in a data limited regime, I didn't necessarily mean with less data, but rather with a finite amount of data. So the paradigm shift is more from like we have infinite data to we have a finite amount of data. The second point is, some sense, model architecture research is exactly what you mentioned. So when you make an improvement on the model architecture side, what it typically means is you get a better result if you use the same amount of data to train the model but equivalently you could get the same result as the previous model by training it on less data. So that's kind of the first aspect of that.
But it is true in terms of the volume of data needed today with still orders of magnitude higher than what the human has available to. Of course there's the whole evolution process as well which I find these high level discussions quite hard to understand or follow because you have to make so many assumptions to convert that amount of data into what is today's pre training data. But at least at first order it does seem like we're using a lot more data than humans do.
What other directions in overall pre training progress are you excited about throughout the industry?
Yeah, think the one thing is in Gemini 1.5, I think we had a really good leap in the long context capabilities of the model and I think that's really enabling the ability of models and agents today to do this work where you have maybe a code base and you do a lot of work on it so your context links really grows. I think there's going to be a lot more innovation on that side in the next year or so to make long context more efficient but also just to extend the context links of models themselves. So that on the capabilities front I think is something where pre training specifically has a lot to offer and is very interesting. Relatedly, I think for us at least on the attention side, we've made some really interesting discoveries recently that I think will shape a lot of the research we do in the next few months and I'm personally very excited about that. Again I think I want to emphasize the point I made towards the beginning but the way things work is it's really a combination of many different things.
There's a lot of small medium sized things that we can already see coming up where I think we fixed this issue, fixed this bug, this is an interesting research that showed promising things and these all of these things coupled I think will drive a lot of the progress again.
It's interesting, you know, thinking about Retro that we talked about a bit earlier. You know, you're the co author of Retro, which was about efficiency and smaller models doing more. And now you're in the world of Gemini three, which is massive amounts of data and training in very long context windows. Do you think that this paradigm of having, again, larger models, large context windows effectively obviates the need for kind of RAG and search and that everything gets folded into the model? I mean, obviously, there's a corporate data part, but in general.
There's some interesting questions here. So first of all, I think retro was really about retrieving information rather than storing it, not necessarily about making models smaller. So it's about how we can use the model to do more reasoning already in in a pretraining sense of reasoning rather than just store their knowledge. So this is still very much the aspect today. The interesting part is the tracing cycle maybe of pretraining used to be a lot slower than that of post training until fairly recently.
And so making these large changes on the pre training side is quite costly in terms of risk and how long it takes. And then you have approaches like RAG or SURCH which you can do during post training and iterate much more quickly on which gives very strong performance as well. I think deep down I do believe that the long term answer is to learn this indifferentiable end to end way which means probably during pre training whatever that looks like in the future, learn to retrieve as part of the training and learn how to do search as part of the large part of training. And I think that's kind of RL scaling maybe starts that process, but I think there's a lot more to do also on the architecture side. But this is something that we'll see in the next few years and not immediately, I would say.
The one thing I want to highlight is people often talk about model architecture, and that's definitely one part of what makes pretraining better. But there's other parts as well, infra and data and eval specifically, that don't always get the same mention. Eval specifically is extremely hard, and it's even harder in pretraining, I would say, because it kind of has these two gaps you need to close. So on the one side the evals we use or the models we train regularly are much smaller and less powerful than when we scale up. So that means the eval has to be predictive of what the performance or have to still work for the large model and point in the right direction, so it has to be a good proxy on that side.
And then there's a second gap as well, which is when we evaluate pretraining models, there's a post training gap as well. The way models get used is they don't just get used pretraining, there's more training happening after. And so the evals we use in pretraining or in pretrained models have to be good proxies of what happens after as well. And so making progress on evals is is is really important and has been quite hard and has also driven a lot of the progress we have in terms of being able to measure what the natural improvement is on the model or on the data side.
And evals at DeepMind, that's all internally built, like you have your own set of evals?
Yes, to a large extent and more and more so because what we found is that external benchmarks, you can use them for a little while but very quickly they become contaminated so they start to be replicated on different forms or different parts of the web and then if we end up training on those, it's really hard basically to detect leaked evals. The only way you really have to protect against cheating yourself and thinking you're doing better than you are is by actually creating held out evals and not really keeping them held out.
In the same vein, is alignment a part of what you all think a lot about at the pre training level? Or is that more of a post training kind of conversation or both?
It's a majority of post training, I would say, but there's definitely some some parts of it which are relevant of pre training. I can't go into too many details here, but some parts are are relevant to pre training, and and we do think about that as well.
And at a very simplistic level, I always wonder again in the context of Gemini or otherwise, if the core dataset is the internet, there's a lot of terrible things on the internet, is alignment 101 that there's stuff that you just do not include in the model?
This is an interesting question and I don't think I have a definitive answer, but you don't want the model to do these terrible things. So at a fundamental level, you did you do need the model to know about those things. So you have to train a bit at least on those so that it knows what those things are and knows to stay away from those. Right? Otherwise, when a user would mention something terrible, the model wouldn't even know what what it's talking about and and and might not be able to say this is something terrible.
Right?
Let's talk about DeepThink, like the the thinking model that was released a few days after Gemini three. So first of all, is that different model or is that part of the same model? How how should one think about it?
I'm not allowed to I I can't comment too much on your specific concern.
So what happens when the model thinks and, you know, you wait for ten seconds or twenty seconds or whatever time? What happens behind the scenes?
Yes. I mean, I think this has been covered quite a bit in in in some of your previous podcasts as well, but the It's about generating thoughts. Rather than just doing computing in the depth or in the model side, you also do compute and allow the model to think more on the sequence length side of things. The model actually starts to form hypotheses, test hypotheses, invoke some tools to validate the hypotheses, do search calls, etc. And then at the end may be able to view the thought process to provide a definite answer to the user.
The industry has normalized around that paradigm of Gene of Thought?
That's for, yeah.
Can you talk a little bit about the agentic part of this and Google anti gravity? What do you find interesting about it? What should people know about it?
Yeah, this is I guess what I was mentioning before around my own work especially. Think that's interesting. A lot of the work we do on a day to day basis is more execution based, babysitting experiments, etc. And I think this is where I at least see the most impact from those. Bringing it back to the topics of pre training, I think the perception and vision side is very important for this because now you're asking models to interact with computer screens.
Being So able to do screen understanding really, really well is critical. And so that's an important part on the pre training side, at least.
And in anti gravity, there's a whole vibe coding aspect, truly vibes in that, like, you don't even really see what happens when you ask. Is is vibes same question. Is that a retraining thing? Is that just a post training thing? How do you build vibes into a model?
Yeah. This is interesting. I think you can probably ask five different researchers and you'll get five different answers. There's also this this notion of large model feel. People call this, especially I think GPT 4.5 historically had some of this presumably where larger models maybe feel differently.
I wouldn't actually just I wouldn't put it in these terms specifically, but I think vibes comes down to this and actually pre training probably plays a larger role today in some of that and how the model feels and then generally than than post training. I think this is yeah. This is in general for for Vibe coding specifically. I think that's that's maybe more of an RL scaling and and post training thing where where you you can actually get quite a lot of of data and and train the model to do that really well.
So zooming out a little bit, maybe for the last part of this conversation, I'm curious about where things are going in general. There was a key theme discussed at NeurIPS this year around continual learning. And I'm curious about your perspective, especially from a pre training perspective, right? Because we are in this paradigm where, like, every few months or years, we and by we, I mean you train a very large new base model. First of all, what is continual learning?
And two, how does that impact retraining if continual learning becomes a thing?
Yeah, guess continual learning is about updating the the model with with new knowledge as as new knowledge is discovered. Right? Let's say a new scientific breakthrough is made tomorrow. The base model we trained yesterday wouldn't actually know about it in in its pretraining. First, I think a lot of progress has been made on on this front since in the last few years.
I think this is mostly around post training, around search. Use search tools and then and make search calls, then they would have access to that new information. In some sense, is also what retro that we talked about was doing by retrieving data and then trying to externalize the knowledge corpus with the reasoning part. So that's the first part. I think the second part is on the pretraining side specifically is what I was mentioning on long context as well.
One way of doing this is if you kind of keep expanding the context of the user, the model keeps getting more and more information in that context. So you kind of have this continual learning aspect part of that. But then of course there's more of a paradigm shift maybe this is what people discuss is can you change the training algorithm such that you can continuously train them on a stream of data coming the world basically.
Beyond continual learning, what do you think is hotinteresting or intriguing in current research today?
Yeah, lot there's of again, there's a lot of small things right now that accumulate, so that's kind of the first thought that comes to my mind. That historically has really driven progress, so I wouldn't just bet against that continuing to drive progress. The things I mentioned before around the long context architecture and long context research is one aspect. I think on the attention mechanism as well on the pretraining side. And then this paradigm shift from infinite data to the limited data or finite data regime.
It is something as well, I think, where where a lot of things will change, there's a lot of interesting research. That's kind of on the pretraining alone side. The other side, which is quite interesting, today is these models become the amount of people using these models is is growing quite rapidly. So more and more what we have to think about on the pretraining side as well is how expensive is the model to use, to serve, and then have really deployed at a large scale, and what things on the pretraining side specifically can we do to make this model have better quality and maybe be cheaper to serve and consume fewer resources during inference.
For any student or like PhD student listening to this, if they want to become you in a few years, what problems do you think they should think about or focus on that's not, you know, like a year or two out, but, like, more interesting sort of a few years out?
One thing that's becoming increasingly important is being able to do research but being aware of the system side of things. So we are building these fairly complicated systems now. Being able to understand how the stack works all the way down from TPUs to research is kind of a superpower because then you are able to kind of find these gaps in between different layers that other people won't necessarily be able to see but also to reason through the implication of your research idea all the way down to the TPU stack. And people that can do that well I think have a lot of impact in general. So in terms of specialization, it's really thinking about this research engineering and systems aspect of the model research and not just pure model architecture research.
That's one. I think personally I still have a lot of interest in kind of this retrieval research as well that we started with retro, and I think it wasn't quite ripe until now. But the things are changing and then I don't I just think it's not unreasonable to to think in the next few years something like that might actually become viable for for a leading model like Genmab.
And why was it not ripe and why made that change?
I think that's that's around the complexity side of things I was mentioning and also the fact that all the capabilities it brings, you can iterate much more quickly in post training. What I was saying with search and post training data, can give very similar capabilities to the model in a much simpler way. And as post training grows and RL scaling grows as well, maybe that shifts again towards more on the pre training side.
Do you think there are areas of AI right now that are over invested in, where there's a disconnect between what makes sense and where the industry is actually going and investing dollars in?
I think it's got a lot better. I think maybe two years ago what I was seeing is people were still trying to very much create specialized models to solve tasks that were maybe within half a year or a year of reach of generalist models, I think people have caught up to that much more and now kind of believe that for generalist tasks or tasks which are not don't require extreme specialized models, trying to use a generalist model and maybe not the current version, but the next version might be able to do that. So what that means is research in terms of how you use models and the harness, etc, is becoming increasingly important and also how you make models and harnesses more robust to making errors and recover from from such errors.
Yeah. In in that vein, do have any advice or recommendation for startups? Right? So it's seen from the perspective of a founder or the the VCs who love them, there is this feeling that the base models are becoming ever so powerful and then trained on, like, multiple datasets. So it used to be, you know, the model is able to converse, but, now it's able to do financial work and cap tables and that kind of thing, which seems to shrink the area of possibility for for start ups.
Do you have thoughts on that?
Yeah. I think so maybe have you look at what models were able to do a year or a year and a half ago and then look at what models are able to do today and try to extrapolate that. I think the models, the areas where the models are improving, I think will continue to improve. And then there's maybe some areas where there's not been that much progress and that might be more interesting areas to do research. I don't really have a specific example in mind right now but that would be the general advice.
What are you excited about for the next of year or two in terms of like your personal sort of journey?
What I like very much about my day to day is working with many people and being able to learn from a lot of researchers. That's what drives me to a large extent. Every day I come to work and I talk to really really brilliant people and they teach me things that I didn't know before. And and so that that's I I really like that part of my job. What I was saying multiple times at this point, but there are just so many different things that will compound and and different things where there's headroom to improve.
I'm I'm really, really curious because I right now, I don't really see an end in sight for for that kind of line of work to to continue giving us progress. Actually being able to see this through and and see how far this can take us will this is really interesting because at least for the next year or so, I don't see this slowing down in any way.
Great. Well, that feels like a wonderful place to live in. Sebastian, thank you so much for being on the part. Really appreciate it. That was fantastic.
Thank you.
Thank you, Matt.
Hi. It's Matt Turk again. Thanks for listening to this episode of the MAD Podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests.
Thanks and see you at the next episode.
DeepMind Gemini 3 Lead: What Comes After "Infinite Data"
Ask me anything about this podcast episode...
Try asking: