| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
Edwin Chen is the founder and CEO of Surge AI, the data infrastructure company behind nearly every major frontier model. Surge works with OpenAI, Anthropic, Meta, and Google, providing the high-qualit...
Edwin Chen, CEO of Surge AI, reveals critical insights from working with all major frontier labs (OpenAI, Anthropic, Meta, Google). He exposes how optimizing for popular benchmarks like LMArena is 'basically optimizing for clickbait,' shares how one lab's models regressed for 6-12 months without detection due to poor measurement, and explains why frontier labs are taking surprisingly divergent paths to AGI. Chen's evolved view: there won't be 'one model to rule them all' but rather a constellation of models with different theses and personalities.
Chen explains how optimizing for LMArena leads to models with more emojis, formatting, and verbosity rather than accuracy. Users spend 1-2 seconds evaluating responses, preferring longer, flashier outputs over correct ones. He provides concrete examples of mathematically wrong responses being preferred, and how this optimization creates models that hallucinate more because users don't fact-check.
Chen reveals a case study where a major lab's coding models actually got worse over 6-12 months because expert coders weren't executing code to verify correctness. The training data was full of 'flowery language and grandiose claims' but subtle bugs. Without proper measurements, the team had no quantitative evidence of regression while the rest of the industry progressed.
Chen explains how frontier labs discovered that optimizing for academic benchmarks creates models that excel at narrow tasks but regress on real-world problems. He compares it to optimizing for the SAT - you get good at a specific test but not at complex real-world problem solving. Labs would see impressive benchmark scores while models actually got worse.
Chen breaks down the four critical components of high-quality evaluators: domain expertise (e.g., algebraic topology, PyTorch), sophistication/taste (well-designed code, great prose), creativity in prompt generation spanning the full distribution of use cases, and ability to follow complex instructions. He emphasizes that creating diverse, creative prompts is surprisingly difficult.
Chen describes Surge's work on RL environments for over 1-2 years, particularly with Meta's agents team. Creating effective RL environments requires building rich simulated worlds populated with people, businesses, tools, and interactions (Slack, email, calendar), plus infrastructure for models to execute within these worlds, and deep measurement of model trajectories to understand failures.
Chen argues against credential-based hiring, using Hemingway (no college degree) as an example. Surge uses a meritocratic platform measuring millions of signals daily on worker output rather than degrees. He notes that even MIT CS grads often can't code well, and credentialed workers frequently try to game the system rather than create quality data.
Chen reveals surprising divergence among frontier labs in both training approaches and objective functions. Key example: OpenAI optimizes for user engagement and session length, while Anthropic optimizes for productivity and GDP/time savings. Some labs completely ignore LMArena while others feel pressured to optimize for it, leading to different model personalities and capabilities.
Chen's biggest mind change: he previously believed in one superintelligent model that could context-switch for everything. Now believes every company needs a thesis on what AI should do, creating models with different personalities and biases. Compares to Google vs. Facebook - each would build fundamentally different social media or search products based on their beliefs.
Chen reveals that 50%+ of Surge's work is already in non-text domains (video, robotics, bio). For video, quality means going beyond robotic correctness to taste and sophistication. He uses the analogy of Scorsese vs. a high school graduate filming a fish - both can follow instructions, but one creates something that 'blows your mind.'
Chen proposes a novel objective function for AI: a month later, would users be happy they had this interaction? Did it change their life? Examples include serendipitous vacation discoveries or medical insights users wouldn't have found otherwise. He draws parallels between AI optimization challenges and long-standing societal problems like SAT limitations.
Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste
Ask me anything about this podcast episode...
Try asking: