Episode	Podcast	Published	Duration	Status

The MAD Podcast with Matt Turck

What’s Next for AI? OpenAI’s Łukasz Kaiser (Transformer Co-Author)

November 26, 2025•1h 5m•12,003 words

Description

We’re told that AI progress is slowing down, that pre-training has hit a wall, that scaling laws are running out of road. Yet we’re releasing this episode in the middle of a wild couple of weeks that ...

Summary

Łukasz Kaiser, co-author of the Transformer paper and OpenAI research scientist, explains why AI progress narratives of slowdown are fundamentally wrong. The conversation reveals how reasoning models represent a new paradigm shift comparable to Transformers themselves, delivering exponential capability gains through reinforcement learning rather than just scaling pre-training. Kaiser provides rare insight into the engineering realities of frontier AI development, from GPU allocation to the surprising limitations that still exist (like failing simple first-grade math puzzles), while demonstrating why the combination of improved reasoning, multimodal capabilities, and tool use will drive the next wave of AI advancement.

Jump to Topic

Why AI Progress Isn't Slowing: The Reasoning Paradigm Shift

Kaiser dismantles the AI slowdown narrative, explaining that while pre-training has reached the upper part of its S-curve, reasoning models represent an entirely new paradigm delivering better results at the same cost. He draws parallels to Moore's Law, where smooth exponential progress masks multiple underlying technology transitions, and explains how reinforcement learning for reasoning is still in its early, high-growth phase.

•Pre-training scaling laws still work perfectly (log-linear loss decrease with compute), but reasoning models provide much better ROI at the same price point
•The transition from GPT-3.5 to current models happened so fast users may have 'blinked and missed it' - models now browse websites, verify information, and use tools instead of hallucinating from memory
•Reasoning models started only ~3 years ago at OpenAI (o1 preview released ~1 year ago), making this paradigm extremely new with massive room for improvement
•There's a significant education gap - models can already handle college homework, repair instructions from photos, and complex coding tasks that most users don't realize

Low-Hanging Fruit: Obvious Improvements Coming to AI Models

Kaiser reveals the massive amount of 'obvious' improvements still available in frontier AI development, spanning engineering infrastructure, data quality, synthetic data generation, and multimodal capabilities. He emphasizes that much of this work is unglamorous engineering rather than breakthrough science, but will deliver substantial capability gains.

•Huge engineering improvements available: ML code is 'beautifully forgiving' - buggy code runs slower with worse results rather than crashing, creating continuous optimization opportunities
•Data quality evolution: moved from raw Common Crawl to filtered datasets, now incorporating synthetic data (but generation method, source model, and engineering details critically matter)
•Multimodal capabilities significantly lag behind text - clear path to improvement through known methods, but requires full base model retraining (months of investment)
•RL is 'more finicky than pretraining' and harder to do correctly, with daily work focused on tuning and understanding the training process

Deep Dive: How Reasoning Models Actually Work

Kaiser provides a technical explanation of reasoning models, distinguishing them from base LLMs through their chain-of-thought generation and reinforcement learning training. He explains how RL allows models to learn verification and self-correction strategies, but currently requires verifiable domains like math and coding, limiting broader application.

•Reasoning models generate 'thinking tokens' (chain of thought) before answers, trained via RL to optimize for correct final outputs rather than just next-token prediction
•Current RL requires fairly verifiable data (math, coding, science test questions work well; creative tasks like poetry evaluation don't yet)
•Models learn emergent strategies like self-verification and error correction through RL - 'clearly a good thinking strategy to verify what you want to say and think again if there may be an error'
•Chain of thought shown to users is a summary of the full reasoning process (generated by another model) because raw chains are 'messy' and repetitive

The Transformer Origin Story: How Eight People Never Met in One Room

Kaiser shares the surprisingly distributed origin of the Transformer paper, revealing that all eight co-authors never physically met together. He explains how different researchers approached the problem from multiple angles - attention mechanisms, knowledge storage, and critical engineering work to make training actually function.

•The famous photo of all eight Transformer authors together was fake - they never were in the same physical room
•Self-attention was the key novelty, but Transformer success required multiple components: attention mechanisms (Vaswani, Polosukhin), knowledge storage via feed-forward layers (Shazeer's mixture of experts work), and critical engineering
•Making Transformers actually train required innovations like learning rate warmup and optimizer tweaks - 'back then, it totally did not' just work out of the box
•Kaiser faced pushback on building APIs for multiple tasks on one model: 'Why do you even do that? If you have a different task, you train another model' - the multi-task paradigm wasn't obvious

From Google Brain to OpenAI: Culture and Scale Transitions

Kaiser describes his transition from Google Brain (which grew from ~40 to 4,000 people during his tenure) to OpenAI during COVID, motivated by the desire to work in smaller teams and the challenges of remote work at a massive organization. He notes that frontier AI labs are more similar to each other than different, with the real gap being between academia and industry.

•Google Brain grew from ~40 people when Kaiser joined to 4,000 when he left - 'very different to work in a small group and in a huge company'
•French academic system allows 10-year leave with guaranteed return - 'very important' for enabling risky career moves (multiple Nobel Prize winners used this)
•OpenAI's smaller team size during COVID offered better collaboration opportunities when Google was 'reopening extremely slowly'
•Frontier labs (OpenAI, Google, Anthropic) are more similar to each other than people think - bigger difference is between any tech lab and universities

The Economics of Pre-training vs. Reasoning: Why Smaller Models Matter

Kaiser explains how the economics of serving billions of users fundamentally changed AI development priorities. OpenAI shifted from training only the largest possible models to optimizing for cost-effectiveness through smaller models and distillation, while maintaining the ability to scale pre-training when economically justified.

•Pre-training hasn't 'fizzled out' - it's on the upper part of the S-curve but scaling laws still work perfectly, just requires massive capital investment for incremental gains
•ChatGPT's billion users created GPU constraints: 'people will not want to pay you enough to chat with the bigger model' - economic necessity drove smaller model development
•Distillation rediscovered as crucial: train huge teacher models, then distill knowledge to smaller, cheaper student models for production use
•With new GPU investments coming online and distillation proven, expect 'resurgence of pretraining' - large models justify their cost by enabling multiple distilled variants

GPT-4 to GPT-5.1: What Actually Changed (Hint: Mostly Post-Training)

Kaiser reveals that the evolution from GPT-4 to GPT-5.1 involved less fundamental change than users might think. The biggest shift was adding reasoning via RL and synthetic data, while 5.1 specifically represents mostly post-training improvements around safety, tone control, and reducing hallucinations through better tool use and verification.

•GPT-4 to GPT-5: main change was reasoning (RL + synthetic data), not pre-training scale - pre-training focused on making things cheaper, not better
•Price dropped ~1000x between GPT-4 and GPT-5 era through efficiency improvements and smaller models
•GPT-5.1 is 'mostly post-training improvement' - better safety handling, tone control (professional/nerdy options), reduced hallucinations through tool use
•Model naming now decoupled from technical milestones (pre-training runs, RL runs) - named by user-facing capability instead of internal architecture versions

The Jagged Frontier: Why Models Fail First-Grade Math But Win Olympiads

Kaiser demonstrates a critical limitation of current reasoning models through a striking example: frontier models that achieve gold medals at Mathematical Olympiads cannot solve simple first-grade visual math puzzles. This 'jagged' capability profile reveals how reasoning models excel in narrow, well-trained domains but struggle with basic multimodal reasoning and in-context learning.

•Live demo: GPT-5.1 and Gemini 3 both fail to count dots correctly in a first-grade math puzzle (missing shared dots between groups), while a 5-year-old solves it in 15 seconds
•GPT-5.1 Pro eventually solves it by running Python code to extract and count dots - takes 15 minutes vs. 15 seconds for a human child
•Reasoning models are 'jagged' - amazing at Mathematical/CS Olympiad level, terrible at nearby tasks due to narrow training on science domains
•Core issues: multimodal reasoning undertrained, models don't learn well from reasoning examples in-context, visual domain thinking needs much more work

The Generalization Question: Is Reasoning Enough for True Intelligence?

Kaiser frames the central question in AI research: whether reasoning capabilities will be sufficient to achieve human-like generalization, or if fundamentally different architectural approaches are needed. He emphasizes that we won't know until we've exhausted current approaches, comparing it to 'driving fast in a fog.'

•Generalization is 'the most important topic' in machine learning and understanding intelligence - can models learn general strategies vs. memorizing specific patterns?
•Pre-training doesn't necessarily increase generalization - it just uses more knowledge by scaling data with model size
•Reasoning does increase generalization (strategies like 'look up on web and verify' transfer across domains), but currently trained on too-narrow domains to fully evaluate
•'You cannot know whether there is a wall or not until you come close to it' - must first solve multimodal, general RL, and other known limitations before determining if new paradigms are needed

Codex Max and Long-Running AI Workflows: Engineering the Future of AI Development

Kaiser explains the technical challenges and solutions behind GPT-5.1 Codex Max, designed for week-long software engineering tasks. The system uses context compaction (summarization and selective forgetting) to operate across millions of tokens, while training prevents the model from getting lost in long feedback loops.

•Real engineering tasks require week-long workflows (implement idea, test, find bugs, fix) - models must handle extended time horizons without getting lost in loops
•Context compaction: model summarizes important past information, forgets less critical parts to avoid n² attention matrix memory explosion
•Major understated challenge: connecting models to external world (GPUs, clusters) has security implications - 'fundamentally very hard problem because you can break things in the real world'
•Goal is 'AI intern by end of next year' - can implement research ideas across distributed systems, but not yet capable of replacing AI researchers

What's Left to Build? The Future of AI Products and Human Work

Kaiser addresses concerns about AI replacing all work by pointing to persistent limitations (first-grade math failures), the translation industry paradox (grown despite automation), and the fundamental question of trust. He argues that while some jobs will change dramatically, there will always be things people want humans to do, especially in high-stakes scenarios.

•Translation industry has grown considerably since Transformers, with translators paid more - automation created more demand, not less
•Trust remains critical: would you publish AI-translated content for a billion users without human review, even if translation is 'almost certainly correct'?
•Robotics will be the most visible change when multimodal reasoning improves - 'maybe more visible than chat' but hardware challenges remain (accidents, deployment safety)
•We adapt shockingly fast: San Francisco got used to self-driving cars almost immediately - 'stunning how quickly we get used to these things'

The MAD Podcast with Matt Turck

What’s Next for AI? OpenAI’s Łukasz Kaiser (Transformer Co-Author)

0:00 / 0:00

View original episode →

Summary

Jump to Topic

Why AI Progress Isn't Slowing: The Reasoning Paradigm Shift

•Pre-training scaling laws still work perfectly (log-linear loss decrease with compute), but reasoning models provide much better ROI at the same price point
•The transition from GPT-3.5 to current models happened so fast users may have 'blinked and missed it' - models now browse websites, verify information, and use tools instead of hallucinating from memory
•Reasoning models started only ~3 years ago at OpenAI (o1 preview released ~1 year ago), making this paradigm extremely new with massive room for improvement
•There's a significant education gap - models can already handle college homework, repair instructions from photos, and complex coding tasks that most users don't realize

Low-Hanging Fruit: Obvious Improvements Coming to AI Models

•Huge engineering improvements available: ML code is 'beautifully forgiving' - buggy code runs slower with worse results rather than crashing, creating continuous optimization opportunities
•Data quality evolution: moved from raw Common Crawl to filtered datasets, now incorporating synthetic data (but generation method, source model, and engineering details critically matter)
•Multimodal capabilities significantly lag behind text - clear path to improvement through known methods, but requires full base model retraining (months of investment)
•RL is 'more finicky than pretraining' and harder to do correctly, with daily work focused on tuning and understanding the training process

Deep Dive: How Reasoning Models Actually Work

•Reasoning models generate 'thinking tokens' (chain of thought) before answers, trained via RL to optimize for correct final outputs rather than just next-token prediction
•Current RL requires fairly verifiable data (math, coding, science test questions work well; creative tasks like poetry evaluation don't yet)
•Models learn emergent strategies like self-verification and error correction through RL - 'clearly a good thinking strategy to verify what you want to say and think again if there may be an error'
•Chain of thought shown to users is a summary of the full reasoning process (generated by another model) because raw chains are 'messy' and repetitive

The Transformer Origin Story: How Eight People Never Met in One Room

•The famous photo of all eight Transformer authors together was fake - they never were in the same physical room
•Self-attention was the key novelty, but Transformer success required multiple components: attention mechanisms (Vaswani, Polosukhin), knowledge storage via feed-forward layers (Shazeer's mixture of experts work), and critical engineering
•Making Transformers actually train required innovations like learning rate warmup and optimizer tweaks - 'back then, it totally did not' just work out of the box
•Kaiser faced pushback on building APIs for multiple tasks on one model: 'Why do you even do that? If you have a different task, you train another model' - the multi-task paradigm wasn't obvious

From Google Brain to OpenAI: Culture and Scale Transitions

•Google Brain grew from ~40 people when Kaiser joined to 4,000 when he left - 'very different to work in a small group and in a huge company'
•French academic system allows 10-year leave with guaranteed return - 'very important' for enabling risky career moves (multiple Nobel Prize winners used this)
•OpenAI's smaller team size during COVID offered better collaboration opportunities when Google was 'reopening extremely slowly'
•Frontier labs (OpenAI, Google, Anthropic) are more similar to each other than people think - bigger difference is between any tech lab and universities

The Economics of Pre-training vs. Reasoning: Why Smaller Models Matter

•Pre-training hasn't 'fizzled out' - it's on the upper part of the S-curve but scaling laws still work perfectly, just requires massive capital investment for incremental gains
•ChatGPT's billion users created GPU constraints: 'people will not want to pay you enough to chat with the bigger model' - economic necessity drove smaller model development
•Distillation rediscovered as crucial: train huge teacher models, then distill knowledge to smaller, cheaper student models for production use
•With new GPU investments coming online and distillation proven, expect 'resurgence of pretraining' - large models justify their cost by enabling multiple distilled variants

GPT-4 to GPT-5.1: What Actually Changed (Hint: Mostly Post-Training)

•GPT-4 to GPT-5: main change was reasoning (RL + synthetic data), not pre-training scale - pre-training focused on making things cheaper, not better
•Price dropped ~1000x between GPT-4 and GPT-5 era through efficiency improvements and smaller models
•GPT-5.1 is 'mostly post-training improvement' - better safety handling, tone control (professional/nerdy options), reduced hallucinations through tool use
•Model naming now decoupled from technical milestones (pre-training runs, RL runs) - named by user-facing capability instead of internal architecture versions

The Jagged Frontier: Why Models Fail First-Grade Math But Win Olympiads

•Live demo: GPT-5.1 and Gemini 3 both fail to count dots correctly in a first-grade math puzzle (missing shared dots between groups), while a 5-year-old solves it in 15 seconds
•GPT-5.1 Pro eventually solves it by running Python code to extract and count dots - takes 15 minutes vs. 15 seconds for a human child
•Reasoning models are 'jagged' - amazing at Mathematical/CS Olympiad level, terrible at nearby tasks due to narrow training on science domains
•Core issues: multimodal reasoning undertrained, models don't learn well from reasoning examples in-context, visual domain thinking needs much more work

The Generalization Question: Is Reasoning Enough for True Intelligence?

•Generalization is 'the most important topic' in machine learning and understanding intelligence - can models learn general strategies vs. memorizing specific patterns?
•Pre-training doesn't necessarily increase generalization - it just uses more knowledge by scaling data with model size
•Reasoning does increase generalization (strategies like 'look up on web and verify' transfer across domains), but currently trained on too-narrow domains to fully evaluate
•'You cannot know whether there is a wall or not until you come close to it' - must first solve multimodal, general RL, and other known limitations before determining if new paradigms are needed

Codex Max and Long-Running AI Workflows: Engineering the Future of AI Development

•Real engineering tasks require week-long workflows (implement idea, test, find bugs, fix) - models must handle extended time horizons without getting lost in loops
•Context compaction: model summarizes important past information, forgets less critical parts to avoid n² attention matrix memory explosion
•Major understated challenge: connecting models to external world (GPUs, clusters) has security implications - 'fundamentally very hard problem because you can break things in the real world'
•Goal is 'AI intern by end of next year' - can implement research ideas across distributed systems, but not yet capable of replacing AI researchers

What's Left to Build? The Future of AI Products and Human Work

•Translation industry has grown considerably since Transformers, with translators paid more - automation created more demand, not less
•Trust remains critical: would you publish AI-translated content for a billion users without human review, even if translation is 'almost certainly correct'?
•Robotics will be the most visible change when multimodal reasoning improves - 'maybe more visible than chat' but hardware challenges remain (accidents, deployment safety)
•We adapt shockingly fast: San Francisco got used to self-driving cars almost immediately - 'stunning how quickly we get used to these things'

The MAD Podcast with Matt Turck

What’s Next for AI? OpenAI’s Łukasz Kaiser (Transformer Co-Author)

0:00 / 0:00

What’s Next for AI? OpenAI’s Łukasz Kaiser (Transformer Co-Author)

Description

Summary

Jump to Topic

Why AI Progress Isn't Slowing: The Reasoning Paradigm Shift

Low-Hanging Fruit: Obvious Improvements Coming to AI Models

Deep Dive: How Reasoning Models Actually Work

The Transformer Origin Story: How Eight People Never Met in One Room

From Google Brain to OpenAI: Culture and Scale Transitions

The Economics of Pre-training vs. Reasoning: Why Smaller Models Matter

GPT-4 to GPT-5.1: What Actually Changed (Hint: Mostly Post-Training)

The Jagged Frontier: Why Models Fail First-Grade Math But Win Olympiads

The Generalization Question: Is Reasoning Enough for True Intelligence?

Codex Max and Long-Running AI Workflows: Engineering the Future of AI Development

What's Left to Build? The Future of AI Products and Human Work

Navigate

Chat with Episode

Summary

Jump to Topic

Why AI Progress Isn't Slowing: The Reasoning Paradigm Shift

Low-Hanging Fruit: Obvious Improvements Coming to AI Models

Deep Dive: How Reasoning Models Actually Work

The Transformer Origin Story: How Eight People Never Met in One Room

From Google Brain to OpenAI: Culture and Scale Transitions

The Economics of Pre-training vs. Reasoning: Why Smaller Models Matter

GPT-4 to GPT-5.1: What Actually Changed (Hint: Mostly Post-Training)

The Jagged Frontier: Why Models Fail First-Grade Math But Win Olympiads

The Generalization Question: Is Reasoning Enough for True Intelligence?

Codex Max and Long-Running AI Workflows: Engineering the Future of AI Development

What's Left to Build? The Future of AI Products and Human Work

Navigate

Chat with Episode