Episode	Podcast	Published	Duration	Status

Unsupervised Learning with Jacob Effron

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

December 15, 2025•48m•9,447 words•Edwin Chen, Jacob Effron

Description

Edwin Chen is the founder and CEO of Surge AI, the data infrastructure company behind nearly every major frontier model. Surge works with OpenAI, Anthropic, Meta, and Google, providing the high-qualit...

Summary

Edwin Chen, CEO of Surge AI, reveals critical insights from working with all major frontier labs (OpenAI, Anthropic, Meta, Google). He exposes how optimizing for popular benchmarks like LMArena is 'basically optimizing for clickbait,' shares how one lab's models regressed for 6-12 months without detection due to poor measurement, and explains why frontier labs are taking surprisingly divergent paths to AGI. Chen's evolved view: there won't be 'one model to rule them all' but rather a constellation of models with different theses and personalities.

Jump to Topic

Why LMArena Optimization is 'Basically Clickbait' and Destroys Model Quality

Chen explains how optimizing for LMArena leads to models with more emojis, formatting, and verbosity rather than accuracy. Users spend 1-2 seconds evaluating responses, preferring longer, flashier outputs over correct ones. He provides concrete examples of mathematically wrong responses being preferred, and how this optimization creates models that hallucinate more because users don't fact-check.

•LMArena users don't carefully read responses - they scroll for 2 seconds and pick what catches attention
•Models optimized for LMArena become 2-4x more verbose and add excessive formatting/emojis
•Real example: A completely wrong math answer (claiming 1452 has only one divisor) was preferred over the correct response
•Labs that ignore LMArena have built better models; those optimizing for it see increased hallucinations
•Similar to medical ChatGPT study - responses rated higher simply because they were longer, not better

How One Frontier Lab's Models Regressed for 6-12 Months Without Anyone Knowing

Chen reveals a case study where a major lab's coding models actually got worse over 6-12 months because expert coders weren't executing code to verify correctness. The training data was full of 'flowery language and grandiose claims' but subtle bugs. Without proper measurements, the team had no quantitative evidence of regression while the rest of the industry progressed.

•Researchers suspected models were getting worse but had no quantitative evidence due to lack of proper measurements
•Expert coders weren't executing code - too difficult to set up libraries and infrastructure
•Training data contained code with 'flowery language' but was completely wrong or full of subtle bugs
•This happened in coding - a hot area where everyone else was making progress
•Demonstrates critical importance of measurement infrastructure, not just training data quality

The Benchmark Trap: Why Academic Benchmarks Lead to Fake Progress

Chen explains how frontier labs discovered that optimizing for academic benchmarks creates models that excel at narrow tasks but regress on real-world problems. He compares it to optimizing for the SAT - you get good at a specific test but not at complex real-world problem solving. Labs would see impressive benchmark scores while models actually got worse.

•Models are excellent at hill climbing narrowly defined objective functions like benchmarks
•Benchmark data often leaks into training data without teams realizing it
•Teams would double performance on benchmarks while models degraded on real-world tasks
•Analogy: Like optimizing for SAT vs. developing actual complex problem-solving abilities
•Chen could predict when labs gathered synthetic data for benchmarks just by testing model quality drops

What Makes Elite AI Evaluators: Expertise, Taste, Creativity, and Instruction-Following

Chen breaks down the four critical components of high-quality evaluators: domain expertise (e.g., algebraic topology, PyTorch), sophistication/taste (well-designed code, great prose), creativity in prompt generation spanning the full distribution of use cases, and ability to follow complex instructions. He emphasizes that creating diverse, creative prompts is surprisingly difficult.

•Domain expertise required for advanced evaluation sets (algebraic topology research, PyTorch usage)
•Sophistication/taste: evaluating code design quality, essay prose, avoiding 'AI slop'
•Creativity in prompts is underestimated - need to span entire distribution of real-world interactions
•Analogy: Asking someone to name 50 foods is surprisingly hard without constraints
•Ability to follow complex, specific criteria and style guides from frontier labs

RL Environments: Building Rich Simulated Worlds with People, Businesses, and Tools

Chen describes Surge's work on RL environments for over 1-2 years, particularly with Meta's agents team. Creating effective RL environments requires building rich simulated worlds populated with people, businesses, tools, and interactions (Slack, email, calendar), plus infrastructure for models to execute within these worlds, and deep measurement of model trajectories to understand failures.

•Surge has been building RL environments for 1-2 years, working with Meta's GAIA benchmark team
•Environments must simulate real-world complexity: people, businesses, tools, Slack messages, emails, calendar events
•Requires extensive tooling infrastructure: MCP servers, browsers, code execution capabilities
•Critical to analyze model trajectories - models can 'reward hack' to correct answers in unexpected ways
•Models can fail in myriad ways that reveal underlying capability gaps

Why Credentials Don't Equal Quality: The Hemingway Principle for AI Data

Chen argues against credential-based hiring, using Hemingway (no college degree) as an example. Surge uses a meritocratic platform measuring millions of signals daily on worker output rather than degrees. He notes that even MIT CS grads often can't code well, and credentialed workers frequently try to game the system rather than create quality data.

•Hemingway didn't have a PhD or complete college - credentials ≠ quality
•Surge's platform measures millions of signals daily on actual work produced, not resumes
•MIT CS grads with credentials often try to 'cheat the system' rather than create good data
•Half of MIT students Chen interviewed 'can't even code' - textbook intelligence vs. street smarts
•Same problem as benchmarks: models get textbook intelligent but lack real-world capabilities

Frontier Labs Are Taking Wildly Divergent Paths: OpenAI vs Anthropic Optimization

Chen reveals surprising divergence among frontier labs in both training approaches and objective functions. Key example: OpenAI optimizes for user engagement and session length, while Anthropic optimizes for productivity and GDP/time savings. Some labs completely ignore LMArena while others feel pressured to optimize for it, leading to different model personalities and capabilities.

•Every frontier lab has their own take on training paradigms - sometimes wildly different
•OpenAI optimizing for user engagement and daily active users vs. long sessions
•Anthropic optimizing for productivity, GDP creation, and time savings
•Some labs have 'fortitude' to ignore LMArena entirely - those have done better
•Different objectives shape products, talent attraction, and fundamental model capabilities

From 'One Model to Rule Them All' to Constellation of Specialized Models

Chen's biggest mind change: he previously believed in one superintelligent model that could context-switch for everything. Now believes every company needs a thesis on what AI should do, creating models with different personalities and biases. Compares to Google vs. Facebook - each would build fundamentally different social media or search products based on their beliefs.

•Used to believe in 'one model to rule them all' ASI vision - no longer does
•World is too rich for one-size-fits-all solutions - every company needs an AI thesis
•Two equally intelligent models will have different personalities, biases, conversation styles
•Analogy: Google's social media would differ from Facebook's; Facebook's search would differ from Google's
•Eventually every company should train their own models to match their unique thesis and use cases

Quality in Multimodal: Video Evaluation Requires Scorsese-Level Taste

Chen reveals that 50%+ of Surge's work is already in non-text domains (video, robotics, bio). For video, quality means going beyond robotic correctness to taste and sophistication. He uses the analogy of Scorsese vs. a high school graduate filming a fish - both can follow instructions, but one creates something that 'blows your mind.'

•Over 50% of Surge's current work is in domains outside pure text
•Video quality requires creativity in prompt distribution, not just instruction-following
•Scorsese analogy: both he and a high schooler can film a fish, but quality differs dramatically
•Models should craft something imaginative and creative, not just literally follow instructions
•Surge willing to buy hardware equipment and expand to whatever space needed for AGI

The Right Objective Function: Optimizing for Life-Changing Moments a Month Later

Chen proposes a novel objective function for AI: a month later, would users be happy they had this interaction? Did it change their life? Examples include serendipitous vacation discoveries or medical insights users wouldn't have found otherwise. He draws parallels between AI optimization challenges and long-standing societal problems like SAT limitations.

•Proposed objective: 'A month later, would you be happy you had this interaction?'
•Focus on life-changing moments: serendipitous vacation locations, medical insights
•AI should notice things and teach users what they wouldn't have figured out otherwise
•Parallels between AI challenges and existing societal problems (SAT measuring intelligence)
•Worries about AI consequences analogous to social media optimization problems

Unsupervised Learning with Jacob Effron

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

0:00 / 0:00

View original episode →

Summary

Jump to Topic

Why LMArena Optimization is 'Basically Clickbait' and Destroys Model Quality

•LMArena users don't carefully read responses - they scroll for 2 seconds and pick what catches attention
•Models optimized for LMArena become 2-4x more verbose and add excessive formatting/emojis
•Real example: A completely wrong math answer (claiming 1452 has only one divisor) was preferred over the correct response
•Labs that ignore LMArena have built better models; those optimizing for it see increased hallucinations
•Similar to medical ChatGPT study - responses rated higher simply because they were longer, not better

How One Frontier Lab's Models Regressed for 6-12 Months Without Anyone Knowing

•Researchers suspected models were getting worse but had no quantitative evidence due to lack of proper measurements
•Expert coders weren't executing code - too difficult to set up libraries and infrastructure
•Training data contained code with 'flowery language' but was completely wrong or full of subtle bugs
•This happened in coding - a hot area where everyone else was making progress
•Demonstrates critical importance of measurement infrastructure, not just training data quality

The Benchmark Trap: Why Academic Benchmarks Lead to Fake Progress

•Models are excellent at hill climbing narrowly defined objective functions like benchmarks
•Benchmark data often leaks into training data without teams realizing it
•Teams would double performance on benchmarks while models degraded on real-world tasks
•Analogy: Like optimizing for SAT vs. developing actual complex problem-solving abilities
•Chen could predict when labs gathered synthetic data for benchmarks just by testing model quality drops

What Makes Elite AI Evaluators: Expertise, Taste, Creativity, and Instruction-Following

•Domain expertise required for advanced evaluation sets (algebraic topology research, PyTorch usage)
•Sophistication/taste: evaluating code design quality, essay prose, avoiding 'AI slop'
•Creativity in prompts is underestimated - need to span entire distribution of real-world interactions
•Analogy: Asking someone to name 50 foods is surprisingly hard without constraints
•Ability to follow complex, specific criteria and style guides from frontier labs

RL Environments: Building Rich Simulated Worlds with People, Businesses, and Tools

•Surge has been building RL environments for 1-2 years, working with Meta's GAIA benchmark team
•Environments must simulate real-world complexity: people, businesses, tools, Slack messages, emails, calendar events
•Requires extensive tooling infrastructure: MCP servers, browsers, code execution capabilities
•Critical to analyze model trajectories - models can 'reward hack' to correct answers in unexpected ways
•Models can fail in myriad ways that reveal underlying capability gaps

Why Credentials Don't Equal Quality: The Hemingway Principle for AI Data

•Hemingway didn't have a PhD or complete college - credentials ≠ quality
•Surge's platform measures millions of signals daily on actual work produced, not resumes
•MIT CS grads with credentials often try to 'cheat the system' rather than create good data
•Half of MIT students Chen interviewed 'can't even code' - textbook intelligence vs. street smarts
•Same problem as benchmarks: models get textbook intelligent but lack real-world capabilities

Frontier Labs Are Taking Wildly Divergent Paths: OpenAI vs Anthropic Optimization

•Every frontier lab has their own take on training paradigms - sometimes wildly different
•OpenAI optimizing for user engagement and daily active users vs. long sessions
•Anthropic optimizing for productivity, GDP creation, and time savings
•Some labs have 'fortitude' to ignore LMArena entirely - those have done better
•Different objectives shape products, talent attraction, and fundamental model capabilities

From 'One Model to Rule Them All' to Constellation of Specialized Models

•Used to believe in 'one model to rule them all' ASI vision - no longer does
•World is too rich for one-size-fits-all solutions - every company needs an AI thesis
•Two equally intelligent models will have different personalities, biases, conversation styles
•Analogy: Google's social media would differ from Facebook's; Facebook's search would differ from Google's
•Eventually every company should train their own models to match their unique thesis and use cases

Quality in Multimodal: Video Evaluation Requires Scorsese-Level Taste

•Over 50% of Surge's current work is in domains outside pure text
•Video quality requires creativity in prompt distribution, not just instruction-following
•Scorsese analogy: both he and a high schooler can film a fish, but quality differs dramatically
•Models should craft something imaginative and creative, not just literally follow instructions
•Surge willing to buy hardware equipment and expand to whatever space needed for AGI

The Right Objective Function: Optimizing for Life-Changing Moments a Month Later

•Proposed objective: 'A month later, would you be happy you had this interaction?'
•Focus on life-changing moments: serendipitous vacation locations, medical insights
•AI should notice things and teach users what they wouldn't have figured out otherwise
•Parallels between AI challenges and existing societal problems (SAT measuring intelligence)
•Worries about AI consequences analogous to social media optimization problems

Unsupervised Learning with Jacob Effron

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

0:00 / 0:00

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

Description

Summary

Jump to Topic

Why LMArena Optimization is 'Basically Clickbait' and Destroys Model Quality

How One Frontier Lab's Models Regressed for 6-12 Months Without Anyone Knowing

The Benchmark Trap: Why Academic Benchmarks Lead to Fake Progress

What Makes Elite AI Evaluators: Expertise, Taste, Creativity, and Instruction-Following

RL Environments: Building Rich Simulated Worlds with People, Businesses, and Tools

Why Credentials Don't Equal Quality: The Hemingway Principle for AI Data

Frontier Labs Are Taking Wildly Divergent Paths: OpenAI vs Anthropic Optimization

From 'One Model to Rule Them All' to Constellation of Specialized Models

Quality in Multimodal: Video Evaluation Requires Scorsese-Level Taste

The Right Objective Function: Optimizing for Life-Changing Moments a Month Later

Navigate

Chat with Episode

Summary

Jump to Topic

Why LMArena Optimization is 'Basically Clickbait' and Destroys Model Quality

How One Frontier Lab's Models Regressed for 6-12 Months Without Anyone Knowing

The Benchmark Trap: Why Academic Benchmarks Lead to Fake Progress

What Makes Elite AI Evaluators: Expertise, Taste, Creativity, and Instruction-Following

RL Environments: Building Rich Simulated Worlds with People, Businesses, and Tools

Why Credentials Don't Equal Quality: The Hemingway Principle for AI Data

Frontier Labs Are Taking Wildly Divergent Paths: OpenAI vs Anthropic Optimization

From 'One Model to Rule Them All' to Constellation of Specialized Models

Quality in Multimodal: Video Evaluation Requires Scorsese-Level Taste

The Right Objective Function: Optimizing for Life-Changing Moments a Month Later

Navigate

Chat with Episode