| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
From pre-training data curation to shipping GPT-4o, o1, o3, and now GPT-5 thinking and the shopping model, Josh McGrath has lived through the full arc of OpenAI's post-training evolution—from the PPO ...
Josh McGrath from OpenAI's post-training team discusses the evolution from GPT-4.1 to GPT-5.1, emphasizing that the real innovation in post-training isn't optimization methods like PPO vs DPO, but rather data quality and signal trust. He reveals that RLHF and RLVR are both policy gradient methods—the difference is input data quality, not the math. Key insights include the underappreciated importance of GRPO's verifiable reward signals, the critical focus on token efficiency over wall-clock time (GPT-5 to 5.1 dramatically reduced tokens needed while improving evals), and the ongoing challenge of building researchers who excel at both distributed systems and ML work.
Josh explains his transition from pre-training data curation to post-training, motivated by the opportunity to change model behavior by 40% rather than achieving 3% compute efficiency wins. He details the infrastructure complexity of RL runs compared to pre-training, with significantly more moving parts including diverse task grading setups and external/internal partnerships.
Discussion of the shopping model released during Black Friday/Cyber Monday, featuring new interruptibility where users can interrupt the model's chain of thought and provide corrections mid-stream. Josh explains it's a separate model for now to experiment with deep-research-style capabilities for shopping, though models eventually converge in capabilities.
Josh argues that the key insight in post-training is that RLHF and RLVR are both policy gradient methods—the real difference is input data quality and signal trust, not optimization techniques. He highlights GRPO from DeepSeek Math as underappreciated because it introduced verifiable reward signals (math answers you can verify) versus human preferences you can't.
Josh emphasizes thinking in tokens rather than wall-clock time, as token efficiency is the key optimization target. GPT-5 to 5.1 showed significant eval improvements while dramatically reducing token usage—a critical metric for user experience and enabling more tool calls within reasonable serving constraints.
Discussion of long context capabilities, with Josh defending GraphWalks as the key evaluation because it requires complicated transformations across the entire context window, not just sampling from one point. He notes these evals are still climbing and context rot is a temporary issue being actively solved.
Josh identifies the critical hiring gap: researchers who excel at both distributed systems engineering and ML research. The bottleneck in frontier research shifts constantly between systems and ML, but the education system doesn't produce enough people skilled in both areas.
Josh addresses the 'pre-training is dead' meme and the controversial trend of equal compute investment in pre-training and post-training (citing Grok-4 chart). He draws parallels to the steam-to-electricity transition in factories, arguing we're in a 'fog of war' where declaring anything dead is premature—timelines are short but human adoption is slow.
[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI
Ask me anything about this podcast episode...
Try asking: