| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
Ashvin Nair, formerly of OpenAI's reasoning team and now at Cursor, discusses the evolution of RL in language models from robotics roots to achieving IMO/IOI gold medals. He reveals how OpenAI's o1/o3 development was driven by long-term conviction in RL scaling, explains why reasoning models don't generalize beyond training distributions, and describes Cursor's unique approach to co-designing products with models through rapid policy updates every 2 hours. The conversation covers the transition from academic RL research to practical applications, the challenges of continual learning, and why bringing economically useful tasks into the training distribution is more important than raw model intelligence.
Ashvin discusses his transition from robotics research at Berkeley and OpenAI (2017) to language models, explaining why robotics researchers make good LLM practitioners. He argues robotics is still in the 'GPT-1 to GPT-2 era' despite recent demos from Physical Intelligence and Sunday, and believes LLM agents will be a trillion-dollar market before robotics reaches $10B.
Ashvin joined OpenAI in September 2022 on the Codegen/Codex team, right before ChatGPT launched. The team was working on tool use and making models smarter at programming competitions. He reveals that achieving IOI Gold felt completely unreachable at the time - 'we could all just go on vacation, AI is solved' - yet life hasn't changed much despite reaching that milestone.
Deep dive into why the 2017-2022 era of RL research (DQN, off-policy learning, value functions) didn't pan out despite academic excitement. Ashvin explains how the community implicitly overfit to benchmarks by introducing new tunable knobs, and why academia rewards complex mathematical ideas over simple ones that actually work at scale.
Ashvin reveals how OpenAI's reasoning models emerged from long-term conviction in RL dating back to Dota (2017), driven by Ilya Sutskever and Jakob Pachocki. The breakthrough came in 2023 when pretrained models became good enough, with early prototypes showing surprisingly accurate reasoning traces on small models that convinced leadership to scale up massively.
Critical insight on RL's limitations: it's a 'weird, funny tool' that doesn't generalize beyond training distribution well. Models can completely dominate in-distribution tasks (like IOI) but struggle with general programming jobs. The solution isn't smarter models but bringing economically useful tasks into the training distribution through product design.
Ashvin's perspective on the Sam Altman firing/rehiring crisis. He signed the letter to reinstate Sam but was genuinely conflicted about governance structures. Questions whether nonprofit boards or traditional corporate boards (representing pension funds) are better for AGI governance, noting we haven't solved governance even for unhealthy food or social media.
Why Ashvin left OpenAI's infinite resources for Cursor's 20-25 person ML team. Cursor can co-design products with models in ways impossible at larger orgs, exemplified by online tab with policy updates every 2 hours. The key advantage is bringing the entire software engineering process (not just coding) into the product for RL training.
Discussion of continual learning as potentially paradigm-shifting in the next year. The key insight: models train on trillions of tokens but should be able to learn from millions of deployment tokens without forgetting. Explores the 'hard drive vs CPU' view of neural networks and whether weights are for memorization or computation.
[State of RL/Reasoning] IMO/IOI Gold, OpenAI o3/GPT-5, and Cursor Composer — Ashvin Nair, Cursor
Ask me anything about this podcast episode...
Try asking: