Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

December 31, 2025•5,684 words

Description

From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO ...

Well, here is Josh from OpenAI. Welcome. How else do introduce yourself? What what else?

Yeah. I work on a bunch of the thinking models at OpenAI, and, like, recently, I've been sort of focused on doing search related stuff. But, yeah, just a post training researcher at OpenAI.

Yep. And you were on with us for GPT 4.1 when we're talking, with Michelle who's on maternity leave. I I didn't know that. And now we're at 5.1. It's been a it's been a whole generation.

Yeah. It's been wild. And, like, you know, 4.1 was a nonthinking model. And then since then, I know, you we sort of switched into do

your last? When was your last?

No. We're still we still are releasing nonthinking models, but that one was the one that we did that was, like, API specific nonthinking. So, you know, focus has shifted a little.

Yeah. How'd you get into post training?

So previously, before OpenAI, I was doing, like, pre training data curation stuff, and I think what I was seeing from, like, the news and looking at papers is like, oh, it seems like a lot of dead. That you not pretraining is dead, but I was like, oh, there's gonna be so much interesting stuff in post training. And at that point, I was like, I I really wanna, like, make some contributions there. And, I mean, it's not even necessarily that, like, pretraining was dead, but it was definitely changing. And, like, you know, do I wanna make compute efficiency wins of, like, 3%, or do I wanna, like, change the behavior by 40%?

And, honestly, it just seemed more more exciting to go to post training and many late nights later. That's definitely true.

It's a different kind of data and engineering discipline too. It's very strange. Like, the the the kind of work that you need, in especially RL, like, scaling it.

Yeah. Definitely. I think, like, for example, the number of moving parts in an RL run is just a lot higher. Like, in some

ways order of magnitude? Or

I don't if I could do order of magnitude, but if you think about, like, pre training, you know, you're moving tokens to many machines, and then you're getting, like, basically a scaler from them, and then you're backpropping.

Yeah.

The issue with RL is, like, you're doing tasks, and each task could have, like, a a different grading setup. And each one of those different grading setups, that's, like, more infrastructure. And so, you know, when I'm staying up late trying to figure out what's going on with a run, it could be in way more things than there is in a pretraining run, generally.

Yeah. And does it matter if you own the code of the task, or is it an outsourced third party person? Or you know, my sense of it and the external sense of it, obviously, don't see it up close, is that you work a lot with external partners, and I'm sure you also have some internal stuff.

But which is better? Honestly, I don't think I'll comment, like, too much on, like, how many external partners

Well, there there are some, and there are there are some internal. There are, know

Yeah. There there's we do, like It's

a bit of a technical trade off of, like, well, shit. Like, I don't own this code. You know?

So, well, when it comes to I don't own this code, actually, like, when, you know, when I'm babysitting a run or something, it doesn't really matter if it's, like, internal, external, whatever. Like, do I understand the system that's going underneath? And I think you end up having to, like, jump into a lot more code that you're like, I actually don't know what this does. Because, like, I'll be watching the you know, I I work on my pieces of a run, and then there's also, you know, other people working on it. And, like, do I understand what their code is doing so that that way, like, 12:30 in the morning when I'm like, something looks wrong and it's I'm, like, looking at this code.

Can I, like, get context fast enough to understand Throw a codec at it? Wrong. Oh, I use codecs so much. It's really changed how I work. I feel like there's a degree to which, like, sometimes I feel trapped by codecs because if I spend, like, you know, thirty, forty minutes writing something that looks like a design doc or something, Codecs can do more work than I could do in a few hours in, like, fifteen minutes.

But then, like, what do I do during those fifteen minutes after? And, like, it the it's actually just, like, really changed how the flow of my day goes because I have to somehow now manage these, like, forty minute sessions with, like, fifteen minutes where, like, I could do something, but it's actually not nearly as effective as, like, this new flow to the day. So I think I'm still getting used to that, honestly.

Yeah. Yeah. Yeah. I think it should be interesting for, like, also just code based understanding when you're encountering unfamiliar code. Absolutely.

So you briefly, before we started, talked a little bit about the shopping model, which is, like, the the latest hottest thing. And, obviously, we're just recording this right after Black Friday, Cyber Monday. First of all, any interesting findings from basically releasing shopping in Judge Judy, right into that period?

Okay. Well, I think the first thing is, I don't know, like, why I would say in a meeting in, you know, August or so, like, oh, hey. Black Friday is coming up. Like, maybe we could maybe we could do a release by them. In hindsight, like, wait.

Why would I say something like that?

Yes. Yeah. Now you own it.

Yeah. Ex exactly. I guess the most interesting thing to me is the new interruptibility and, like, the the sort of qualitative experience of using it. And the same thing happens with Codex. Right?

Like, you you write a prompt and you can, like, press escape and say, like, oh, I, like, I messed something up. And we actually did the same thing in the shopping model. So it shows you its chain of thought with, like, what products it's looking at. And you can write it new messages saying, like, oh, you know, I actually wanted Didn't hear this. Yeah.

Yeah. Like, I wanted USB C on this or whatever it is. And, like, I think that's a really new inter interesting, like, interaction paradigm that we have in a couple of our different services, and I'm excited to see, how people use it and if they enjoy it.

Yeah. Why did it have to be its own model and not just, like, a a new tool?

Stay tuned. I think, like, there's no reason that we couldn't do it in the same model eventually, but I think, you know, if we wanna try out new things, sometimes it makes sense to to make a new model. And I think it just made sense to, this time, say, like, can we do a deep research style model, but, like, for shopping where it's gonna look really hard, all across the Internet for different things? You know, I think if you look at, like, deep research, the original one, and GPT five thinking on, like, high reasoning today, I think you'll see that, like, eventually, the models all sort of converge in their their capabilities.

Yeah. Would you say that, this is a discussion that also a little spicy to have kicked off in the community. There's still maybe 30% of the community is still using deep research. A lot of them have moved over to just using five thinking as deep research. Is that the spiritual successor?

Are they direct replacements? Are there things that we lose in the original deep research model if we if we do that?

I mean, I think if you look at our published evals, they're they're they look like basically on par if if not better. So, like I mean, that's personally what I do. I I use, like, think, thinking on high, versus using the deep research model. But, like, you know, I think every as we've learned over the past, few months, there are sometimes people prefer the quirks of, like, one model over another. And so people like the deep research model, you know, more power to them.

People like four o? Anything special in the four o post trading that, like, are people, like, really responding to personality? Is that, like, a differentiator that people really care about and and you like, it's a part of your job to care about personality?

Yeah. I mean, definitely people, like, care quite a bit about personality. I think, like, over the past few months, we've been working a lot on giving users more choice over what personality they want.

Right. Which is the the toggles.

Yeah. Yeah. So now we we have those toggles.

What's your favorite toggle?

Honestly, custom instruction for, like, I want I personally want my model to, like, be a tool. And so, like, I don't I don't necessarily, like, want the the warmth or anything. I just want some answers because I'm, you know, I'm mostly using it at work.

Yeah. So I call this the Anton versus Clippy divide. So Anton is the Silicon Valley HBO. Okay. It's a machine.

It's it's it only does work. It doesn't doesn't try to be helpful or friendly or anything. I mean, it tries to be helpful, but, like, doesn't try to be cheery, whereas Clippy tries to be cheery. And I'm like, well, stop smiling at me. I'm, like, having a problem.

It's like a or

So it sounds like you also come down on the side of, like, using it using Anton.

Yeah. Yeah. I think a lot of developers want Anton. Yeah. They they're just like just quietly does its work, and when it's done, it shuts up.

And Yeah. Yeah. Well, I think, like, we're we're doing a lot of work to provide both, like Yeah. People, Anton's and Cliffy's, and I I hope they all like it.

Yeah. So just generally, I was thinking about, like, well, what can we update people on post training? You know, what what do we know today in at in Neuros twenty twenty five that we didn't know in Neuros twenty twenty four? I would say, like, a lot of people at the time, there's there's still, like, this whole PPO versus DPO discussion that was there. That was a whole era.

Yeah. And since then, we've moved on to RLVR, and I think a lot of, like, agents specific RL training. I guess, like, am I missing any large chunks of the post training debates that are going on?

Yeah. I mean so not necessarily debates internal, but, like, my read personally from, like, looking at different papers that are coming out, when you look at, like, an RLVR paper or, like, a RLHF paper, they read more like an optimization paper. And to me, like, the the sort of interesting thing that's going on is we have this, like, spectrum of how high quality a signal is. So, like, really, at the end of the day, like, RLHF, RLVR, they're both policy gradient methods, but the what's different is just, like, the input data. And it's always interesting to me that we call RLHF nonverifiable because we've trained this model to be good at, like, predicting human feedback.

So in some sense, that's like verification. But obviously

It's human preference rather than truth.

Yeah. Yeah. But, like, if the if, like, your value of truth is, like, does the user like this more? Like, see, there's there's something strained that I think we haven't, like, looked at that axis of, okay. Well, how, like, sort of clean is this signal?

How much do I trust it? And, like, I totally agree that, you know, you don't necessarily trust the RLHF signal as much as, like, is this the solution to this polynomial? But I think there's a whole spectrum of, like, how high quality is the signal, what's gonna happen when I, like, do a lot of optimization against it. And that's very different than I think worrying about, like, the variance of different gradients, which

I think is what you end

up seeing in a lot of the the papers that are currently coming out. Rather than being, like, very data centric, they're pretty optimization centric even though I think the the innovation really is is where the data's coming from.

Yeah. And before I wanna go broad before I go deep. Yeah. Any other discussions that maybe you're having in Europe or or sort of roundabout this time on post trading debates? Like, what are what are you meet your your peer at Anthropic and DeepMind, and then, what what are you talking about?

Well, Anthropic and DeepMind, we're all saying I'm working on stuff and things. You know? We're we're I I think, like, it's more so, talking a lot more broadly with my my friends there or or we're just talking about, man, the the infra is so hard to keep up. We're not necessarily talking too much about methods directly.

Because on one level, it kinda doesn't matter.

Yeah. And I think also, like, there's there's something that's very different about academic work where, like, the what really matters is how narrativizable it is. And I think that's one of the reasons you see, like, a lot of optimization papers come out is a lot of the data work, there's a less clear narrative around it.

The I think my my the data and the scaling is actually more important than a specific Yeah.

But it doesn't have, like, necessarily the same narrative that you get out of, like, some of the papers that you see here. And so, like, there becomes more of a, like, given a a specific vertical, how do I, like, understand that? And I I wish there was actually more papers on it here, but I think it can sometimes be harder to wrap up into a a clean story.

Yeah. That's also something that, like, we're we're actually having a lot of conversations about with other folks as well. Like, what's what's next? Right? Like, what what where do you go from here now that we we have, like, some kind of road map?

I think what's interesting also for me is, I guess, the innovations that are exposed by the Chinese models are maybe copies or discussions of what's going on in the labs. I think, obviously, GRPO you mentioned a lot of these RL optimizations. They come out as they present themselves as optimizations. GRPL came out in the DeepSeek math paper, which when it came out, I read it, and I was like, okay. This is kinda cool.

It's, like, a little bit cheaper. But, like, it does seem to have a more broad impact, I think, on the industry as a whole than was initially appreciated. I just wanna I I don't feel like we've processed that enough.

Yeah. Definitely. I mean, like, yeah, as you said, it came out in the DeepSeek math paper, and, like, it's an interesting optimization method, but it's, like, the more interesting thing that they have a new reward signal that they sort of like re that we can really, really trust. Like, when, you know, you find the answer to a math problem, it's a lot less debatable than like, oh, well, is this thing that the human preferred actually what we want to do? Yeah.

Like, you wanna be right at math. Yeah. Yeah. And so I think in some ways, it's underappreciated in, I would say, what's getting published. Yeah.

Yeah. Let's talk about, I guess, long horizon. Yeah. What do people consider in terms of, like, very long horizon? Like, we're talking, like, thirty hours, you know, more than more than a day of autonomy.

Does does it is it just more of the same, or is there anything, like, sort of qualitatively different?

Okay. So first off, what I would first say is I tend to think more in terms of, like, actual number of tokens than than time because I think Yeah.

The human in the loop can take a while. While.

Yeah. Well and also, like, it it gives you a different, measure to optimize against. Right? Like, as I was saying earlier with, when I use codecs, it does something that would take me much longer. You know, it would take me, like, four hours in, you know, ten minutes.

What we can actually push on there is token efficiency. So like Yeah. And that it That is a huge huge research area. Yeah. And so you can see like from five to 5.1, our our overall evals, you know, we we bumped some.

But if you look at a two d plot of how many tokens it takes for us to get that, it went way down. And so I think that's, like, an

Did you guys hear when you when you had that? Like, that was such a great chart.

Dude, I live by those charts. Like, that

those that I live by those your chart? Okay.

Not necessarily that, but, like, that shape of chart. Yeah. Like, I think that's something that we think about a lot just because it contributes so much to your experience. Like, how long does it take to to do this task? Yeah.

And I think the other thing is as you're pushing that token efficiency, it changes, you know, how many tool calls can I make and, like, how many different things can the agent do in a reasonable number of tokens that we can actually serve? Yeah. And so I personally think in terms of tokens. Yeah.

I think the interesting thing or the the hard to understand thing from the outside is having an explicit router in g p t five, but then also basically having an implicit router in terms of the thinking, spending thing, that conflates a little bit. Right? Like, at some point, you do kinda need to merge them or else you're just gonna get these, like, weird bumps where sometimes the router at the top decides something and it's wrong. And, actually, if you just handed it to GPT five, it would have figured it out.

Yeah. And I think, you know, we'll figure out the correct abstractions over time. I think, like, there's a Is the

intention is still to merge? Because it that's what it was said in the paper.

Yeah. I think, like, eventually, you know, we'll have AGI and, like, you're not gonna have to worry too much about how hard to to think directly. It'll just, you know, we'll have an one tool that you always go to and it knows how long to think for and things like that. I think that the abstractions and the way that we drive these things today, it it'll change. And, like, you know, I think even the amount that we've changed from, you know, having a non thinking model that you can choose between two and, like, you know, now we can sort of route and how hard do you wanna think.

We're adding lots of knobs and, you know, eventually it'll it'll probably simplify it.

Yeah. Another super interesting knob that everyone is doing is context compaction or memory compaction. What's going on there? Nothing to share at the moment. Nothing to share.

Okay. Clearly an important feature, clearly inspired by codecs usage as well, obviously. But I think, like, from the engineer's point of view, it feels like I used to do that as part of my harness, and now it's now the model's doing it for me. And I don't know how to think about that, like, in terms of I guess, I used I'm used to having more control, and now I have less.

Yeah. Is is there is there

a specific one? Like, there's specific question. I just I'm just getting, like, feedback on, like, well, is this a trend that, like, we need where you it's basically a permanent fact of life for from here on out.

Oh, I see. You know, I don't know. I worked on long context. That was why I was on last was for v 4.1 where we, you know, I think 10 x the the effective context window for 4.1. And so there'll always be some dance of like, well, if we wanna push as much as what we can do, not only should we increase the length of the context window, but, like, we should also have strategies for keeping that context window available for as long as possible.

I'm guessing that both things will will sort of happen just because we wanna put as much power into the models as possible. Yeah. Yeah. I I think we're still in a period where we should all be expecting changes in the the interfaces that all of the models give to us. That way we can improve the models.

Because if we lock the interface, I think what would be sad from from my perspective is if we lock the interface, if we discover something new about models, we might sort of trap that improvement under an interface that needs to change.

Got it. Talking about long context as well, there is some discussion about, I guess, context rot or, like, the utilization of the context. Even if you gave us, like, a million token context, probably wouldn't use all of it. What's the recommendation there? Where are things going?

Are we gonna have, I guess, perfect context by next year? Is that is that an impossible dream? I don't know.

No. It's not an impossible dream. I think I'll give a shout out to some of the evals that we did for four point one with called Graphwalks where

I love Graphwalks. We covered this in in the podcast.

Yeah. Yeah. Yeah. We did. And, you know, I think if you look over time, all of those, all of those evals are are still timing.

And I think one of

the interesting things about that is you have to do complicated transformations across the entire context window. Like, that's sort of the issue with, those heat map plots of the those different I need a little. But the problem is if you only have to sample from one point in the context window, it's, like, sorta easy. Whereas with those Graphwalks problems, you're having to do multiple transformations across the entire context window. And so I think keep watching those.

I think they've they've been climbing. They'll continue to climb. I would say that that's definitely, like, a temporary issue that we are climbing on over time.

Yeah. So and then, like, is 10,000,000 tokens realistic? Is a 100,000,000? Like, where does does is there a natural end or there's no end and we just are going as far as the eye can see?

Oh gosh. I I don't know. Like, what what do you think? Yeah.

I I feel like okay. There are use cases that require billions, and there are use cases that require many, many billions, maybe trillions.

Yeah. Out of curiosity, like, what what would be billions of token?

We just had a context engineering discussion about, like, a Rag code base over support issues for a a company, and it was a 100,000 documents totaling about 8,000,000,000 tokens. You can't stick that in a context window for now.

That's fair. I guess the so I would still say, like, I don't know. But I think I've been, like, really surprised. It reminds me of when I was doing, like, more information retrieval stuff and, like, b m 25 and these, like, very simple, like, Ngram indexes were, like, just super hard to beat. I think the agents with GREP are, like they feel really similar to

me where it's like just unreasonably effective. Yeah. So so that that but at the end, I will not use your 10,000,000 token context window even if you gave it.

Maybe. But like, what if we're using that context window in service of like some larger goal that just has a lot of sub search calls, which is why I'm saying, like, I I just don't know. And I think that's what makes it so exciting.

Yeah. Yeah. I would say also, like, the other other modalities, like video, would eat up a lot and, like, then, obviously, the hard sciences have proteins and all all that, which a lot a lot of information just encoded in in in physics. So so, I mean, yeah, I I I have mixed feelings about it just because I'm like, well, this will never scale, not with, like, full attention, and we we probably just need to invest in systems anyway, which means we we're good with what we have. I mean, like, get your get your graph walks up.

But, like, I don't know if we need, like, ten, one hundred x when we actually maybe we need to figure out ways to 1,000, 1,000,000 x. Yeah. Right? Like like, these these are just different slopes.

I mean, I'm definitely I'm I'm glad that you're happy with the the current context windows. I think my dream would be to push it and see what happens anyway.

But the engineers Yeah. The engineers' incentive is always to say, well, the systems matter more than the models. Yeah. And the researchers' incentive is to say, well, screw your systems, or we'll just put the models. Oh, no.

So differently. Yeah.

And I think that's one of the most, like, sorta beautiful things about post training at OpenAI is everyone Co design. Yeah. It's it's also co design. Like, you know, I I spend a lot of time just doing our system stuff, and I also do lots of stuff, like, where I'm making graph walks. And I'm, like, doing a lot more, like, things on the learning side.

And I think it's a great culture to have a place where people just move seamlessly between the two.

Yeah. What are you guys hiring for? Because, you remember, you're hiring. What are you guys hiring for that is hard to hire? What is the skill set that is, like, we really need this.

Can't find it. Please, everyone. Go skill up on this.

As my my definitely personal opinion here, I think we're still having trouble, not at OpenAI, but I think as a whole producing lots of people that do lot want to do lots of both systems work and ML work. And I think if if you're trying to push the frontier, you don't know which place is currently bottlenecking the frontier, and it and it changes all the time. I mean, even within one project, it might change multiple times where the the current bottleneck is. But I think the education system we have right now isn't really optimized for that. So, like, I personally I studied math, and then I was very, very lucky to have some, like, great mentors after school that, like, taught me to be a a good software engineer.

But it seems like if we're gonna be in this place for a while, and I I think we will be, we should probably be producing more students that are great at doing both, you know, distributed systems and, like, a lot of core engineering as well as the statistics and other, like, things that are required to be a good machine learning researcher.

If we were to throw codecs at it, obviously, we can't do codecs at everything. That's why it's still let's say you like, which will progress faster? Which is more solvable by LLM?

That is a that's a spicy question.

You can't say they're both equally hard. I don't know. Maybe maybe they are. I mean, they are differently hard. I think one is more hill climbable than the other, which is it, because then we can go go do it.

Okay. I think I think one thing that's slightly simpler about some of the ML research layer you know, ML research is also distributed systems, to be clear. But, like, some of the things that I would say, like, get traditionally called ML research are things that you can treat a bit more of as a black box. Whereas, like, you know, the the environment to train on, you know, building these these different systems is actually just, a complicated engineering problem. And so, theoretically, I would say that they're, like, probably roughly equal, but I think that there's some there's some amount of effort, I feel like, to making the the environments for it.

Yeah. Let's say they require Yes. Yeah. Let's say they require GPUs in themselves as well.

Yeah. Yeah. I I guess they both would, but, yeah, that's that would be my guess. But I I don't have high confidence in it.

Well, so so a lot of people are building this, like, AI scientists. Right? They're they're automating. Yeah. Research, you guys have a your own benchmark on on paper bench.

And though that's the one area that, like, for example, at Clinician, we've just decided to not do because it's so hard. Okay. Any other people on the post training team that you wanna shout out have done, like, interesting work this year? They should get more attention, but they're they're not getting credit.

Well, okay. For sure, everyone on the shopping team that I was just working with. So, like, Andrew Hoyal, Manuka Strata, John Hallman, all all all great people. Yeah. Issa Fulford, obviously, the the manager for it.

And she was the original deep Deep Research. Yeah. Yeah. Yep. Person.

Yeah. Yeah.

There was, three of them. Yeah. Yeah. Yeah. And so definitely that part of the team.

But, I mean, everyone everyone is so great. Like, I think it's hard to think about a list. It's a it's a really fun time on on post training right now. It's exciting every day. Yeah.

It feels like we're all enjoying our Diet Cokes together in the office late at night. Yeah.

I oh, I I did wanted to squeeze this in before we end. Nobody actually serious is saying that pretraining is dead. It's just a meme. There's a lot of work going on in pretraining. And in fact, actually, lot of my researcher friends are saying too much money is going to post training.

That's also spicy. I don't know. One of the charts I hold in memory from this year is the Grok four chart. I don't I don't know if you've seen it, but it's basically saying, well, we scaled pre training to here and about the same level of about this level of compute, and now we're spending the same level of compute on post training as well. That's very controversial, I guess, to me because, like, we're all used to post training taker taking orders of magnitude less data, compute, whatever, and obviously, we're scaling that up now.

Do we get to a point where they're equal? I don't know, But that's a topic for conversation. I think how much do we invest in this versus more, like, different pre trading? Yeah. Are you saying both?

Yeah. So first off, neither neither one of those is dead. I think it's really interesting to sort of be living through something that I you know, all of my other, like, historic or technological revolutions, I think that I read about in in history books. And, like This was live as

this happened. Yeah. This was we don't know the end yet. Yeah.

And so there's this almost, like, fog of war where I'm like, oh, did people think that, like, we got, like, the steam, like, the steam engine and they would have, you know, the factories? I don't know if you know this, but, like, the factories, they used to be, like, very linear because you had to drive, like, one motor across an an entire room. And it made it so when electricity got developed, they just tried to do the same thing. And they're like, ah, this isn't all that useful. And it took I think like a couple of decades before they realized, wait, if we have electricity, we can move the little like stations in what's whatever is most ergonomic.

And then, you know, manufacturing was transformed by electricity. And I think, like, really gives me no confidence in being like, oh, this thing is dead.

Yeah. Our timelines are so short. Yeah. But, usually, the way, like, good ideas get experimented and funded and propagated actually, that's there's still a human timeline. It's not an AI timeline.

Yeah. Yeah. And so I think, like, things will maybe be, like, dormant, but it'll be spiky. Like, there'll be all some, you know, some yeah. Yeah.

And and and then we'll all feel different. It's like we're, what's what's the meme? It's so over. We're so back. Like, it's gonna be that many times, and I think having, like, some some emotional stabilizing to it is probably gonna be good for for everyone's sanity.

Yeah. More sanity. Well, thank you so much for joining. Thanks for all the great post training this year. Yeah.

Thank you. And, yeah, continue giving feedback. I love to hear what you think. Yeah. Awesome.

Latent Space: The AI Engineer Podcast

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

0:00 / 0:00

View original episode →