| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the ...
John Yang discusses the evolution of SWE-bench from an ignored benchmark in October 2023 to the industry standard after Devin's launch, covering new variants (Multimodal, Multilingual with 9 languages across 40 repos), and his latest work on CodeClash - a tournament-based evaluation framework for long-horizon agent development. The conversation explores the limitations of unit tests as verification, the challenge of creating realistic benchmarks that balance autonomy with human interaction, and emerging trends in code evaluation including domain-specific benchmarks and code base understanding.
John recounts how SWE-bench launched in October 2023 with little initial traction until Cognition's Devin demo catalyzed adoption. Walden emailed him just two weeks before the public launch, marking the beginning of the benchmark becoming the de facto industry standard. The discussion covers the evolution from the original Django-heavy benchmark to newer variants including SWE-bench Verified, Multimodal, and Multilingual.
John introduces CodeClash, a novel evaluation framework where multiple language models compete in programming tournaments by maintaining and improving their own codebases over time. Unlike SWE-bench's independent task instances, CodeClash evaluates consequential, long-running development where each round's performance depends on previous modifications, using programming games like Halite as initial arenas.
Overview of the expanding code evaluation landscape including Ophir's group's work on performance optimization (SWEEfficiency, AlgoTune), scientific coding (SciCode), and domain-specific benchmarks. Discussion covers the spectrum from fast completion benchmarks (SciCode as 'human eval but better') to expensive agentic evaluations, plus emerging areas like cybersecurity (SecBench) and SRE tasks.
Discussion of TauBench's user simulator approach and community criticism about under-specified or impossible tasks. John suggests intentionally including impossible tasks as a cheating detection mechanism, referencing 'Impossible Bench' which modified SWE-bench verified tasks to test model refusal capabilities - finding models consistently claim success even on impossible tasks.
Debate on the future direction of code evaluation between long-running autonomous agents (5-24 hour tasks) versus interactive, collaborative approaches. John acknowledges the tension between academic benchmarking goals and real-world usage patterns, where Cognition emphasizes rapid back-and-forth interaction due to inevitable under-specification rather than extended autonomous operation.
Discussion of emerging focus on code base understanding and retrieval as a critical capability for human-AI collaboration. Cognition is pushing code base understanding to help humans comprehend their own code better and enable automatic context engineering for LLMs, though benchmarking understanding remains challenging beyond simple trivia questions that quickly saturate.
[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
Ask me anything about this podcast episode...
Try asking: