Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

December 31, 2025•3,539 words

Description

From creating SWE-bench in a Princeton basement to shipping CodeClash, SWE-bench Multimodal, and SWE-bench Multilingual, John Yang has spent the last year and a half watching his benchmark become the ...

Summary

John Yang discusses the evolution of SWE-bench from an ignored benchmark in October 2023 to the industry standard after Devin's launch, covering new variants (Multimodal, Multilingual with 9 languages across 40 repos), and his latest work on CodeClash - a tournament-based evaluation framework for long-horizon agent development. The conversation explores the limitations of unit tests as verification, the challenge of creating realistic benchmarks that balance autonomy with human interaction, and emerging trends in code evaluation including domain-specific benchmarks and code base understanding.

Jump to Topic

SWE-bench Origin Story and Devin's Impact

John recounts how SWE-bench launched in October 2023 with little initial traction until Cognition's Devin demo catalyzed adoption. Walden emailed him just two weeks before the public launch, marking the beginning of the benchmark becoming the de facto industry standard. The discussion covers the evolution from the original Django-heavy benchmark to newer variants including SWE-bench Verified, Multimodal, and Multilingual.

•SWE-bench was ignored for months after October 2023 launch until Devin's release triggered widespread adoption
•Cognition gave only 2 weeks notice before their groundbreaking Devin demo using SWE-bench
•SWE-bench Multilingual now covers 9 languages across 40 repos (JavaScript, Rust, Java, C, Ruby)
•Independent teams created variants like SWE-bench Pro without original authors' involvement
•Newer benchmarks deliberately diversify beyond Django to address concentration concerns

CodeClash: Tournament-Based Long-Horizon Agent Evaluation

John introduces CodeClash, a novel evaluation framework where multiple language models compete in programming tournaments by maintaining and improving their own codebases over time. Unlike SWE-bench's independent task instances, CodeClash evaluates consequential, long-running development where each round's performance depends on previous modifications, using programming games like Halite as initial arenas.

•CodeClash addresses two SWE-bench limitations: unit tests as verification and independent task instances
•Models maintain separate codebases, edit them each round, then compete in arenas to determine superiority
•Initial release uses programming games (Halite from Two Sigma), with plans for economically valuable arenas
•Enables evaluation of self-determined improvement and long-horizon development capabilities
•Current focus is building real-world utility arenas beyond games, similar to SWE-bench's practical focus

Emerging Code Benchmarks and Domain Specialization

Overview of the expanding code evaluation landscape including Ophir's group's work on performance optimization (SWEEfficiency, AlgoTune), scientific coding (SciCode), and domain-specific benchmarks. Discussion covers the spectrum from fast completion benchmarks (SciCode as 'human eval but better') to expensive agentic evaluations, plus emerging areas like cybersecurity (SecBench) and SRE tasks.

•SWEEfficiency focuses on code optimization (parallelization, SIMD) without behavior changes - keep tests passing but improve runtime
•SciCode provides a faster, cheaper alternative to expensive agentic benchmarks for initial model screening
•Domain-specific benchmarks emerging: SecBench (cybersecurity), SREBench (site reliability), Critical Point (physics)
•Meter's interesting approach: using runtime (human hours worked) as x-axis to show long-running task completion
•Need for both quick completion benchmarks and expensive multi-turn agentic evaluations in evaluation pipeline

TauBench Controversy and Impossible Tasks

Discussion of TauBench's user simulator approach and community criticism about under-specified or impossible tasks. John suggests intentionally including impossible tasks as a cheating detection mechanism, referencing 'Impossible Bench' which modified SWE-bench verified tasks to test model refusal capabilities - finding models consistently claim success even on impossible tasks.

•TauBench faces criticism for under-specified tasks, with Karthik posting rebuttals defending the benchmark
•Intentionally including impossible tasks could flag models reporting >75% accuracy as potentially cheating
•Impossible Bench modified SWE-bench verified to test refusal behavior - models rarely admit inability
•User simulator benchmarks raise research questions about ambiguity and specification quality
•Community debate highlights tension between realistic ambiguity and benchmark reliability

Future of Code Evals: Long Autonomy vs Human Interaction

Debate on the future direction of code evaluation between long-running autonomous agents (5-24 hour tasks) versus interactive, collaborative approaches. John acknowledges the tension between academic benchmarking goals and real-world usage patterns, where Cognition emphasizes rapid back-and-forth interaction due to inevitable under-specification rather than extended autonomous operation.

•Terminal Bench's creative task design allows more flexibility than SWE-bench's constraint to existing issues/PRs
•Vision of 5+ hour autonomous coding sessions may not align with real-world interactive development patterns
•Cognition's experience shows users want fast iteration cycles, not long autonomous runs, due to under-specification
•Need to balance different abstraction levels: hands-on tasks (Windsurf) vs. walk-away tasks (JSON parsing)
•Academic challenge: building compelling products (like LM Arena) or realistic user simulators to gather interaction data

Code Base Understanding and Context Engineering

Discussion of emerging focus on code base understanding and retrieval as a critical capability for human-AI collaboration. Cognition is pushing code base understanding to help humans comprehend their own code better and enable automatic context engineering for LLMs, though benchmarking understanding remains challenging beyond simple trivia questions that quickly saturate.

•Code base understanding enables human-machine 'mind meld' for tasks neither can do alone
•Automatic context engineering acts as research sub-agent to optimize LLM task performance
•Benchmarking understanding is difficult - frozen repo with curated Q&A saturates quickly
•CodeClash can serve as testbed for human-AI interaction experiments with different team compositions
•Future research areas: multi-agent collaboration, human+agent teams, interaction patterns as models improve

Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

0:00 / 0:00

View original episode →

Summary

Jump to Topic

SWE-bench Origin Story and Devin's Impact

•SWE-bench was ignored for months after October 2023 launch until Devin's release triggered widespread adoption
•Cognition gave only 2 weeks notice before their groundbreaking Devin demo using SWE-bench
•SWE-bench Multilingual now covers 9 languages across 40 repos (JavaScript, Rust, Java, C, Ruby)
•Independent teams created variants like SWE-bench Pro without original authors' involvement
•Newer benchmarks deliberately diversify beyond Django to address concentration concerns

CodeClash: Tournament-Based Long-Horizon Agent Evaluation

•CodeClash addresses two SWE-bench limitations: unit tests as verification and independent task instances
•Models maintain separate codebases, edit them each round, then compete in arenas to determine superiority
•Initial release uses programming games (Halite from Two Sigma), with plans for economically valuable arenas
•Enables evaluation of self-determined improvement and long-horizon development capabilities
•Current focus is building real-world utility arenas beyond games, similar to SWE-bench's practical focus

Emerging Code Benchmarks and Domain Specialization

•SWEEfficiency focuses on code optimization (parallelization, SIMD) without behavior changes - keep tests passing but improve runtime
•SciCode provides a faster, cheaper alternative to expensive agentic benchmarks for initial model screening
•Domain-specific benchmarks emerging: SecBench (cybersecurity), SREBench (site reliability), Critical Point (physics)
•Meter's interesting approach: using runtime (human hours worked) as x-axis to show long-running task completion
•Need for both quick completion benchmarks and expensive multi-turn agentic evaluations in evaluation pipeline

TauBench Controversy and Impossible Tasks

•TauBench faces criticism for under-specified tasks, with Karthik posting rebuttals defending the benchmark
•Intentionally including impossible tasks could flag models reporting >75% accuracy as potentially cheating
•Impossible Bench modified SWE-bench verified to test refusal behavior - models rarely admit inability
•User simulator benchmarks raise research questions about ambiguity and specification quality
•Community debate highlights tension between realistic ambiguity and benchmark reliability

Future of Code Evals: Long Autonomy vs Human Interaction

•Terminal Bench's creative task design allows more flexibility than SWE-bench's constraint to existing issues/PRs
•Vision of 5+ hour autonomous coding sessions may not align with real-world interactive development patterns
•Cognition's experience shows users want fast iteration cycles, not long autonomous runs, due to under-specification
•Need to balance different abstraction levels: hands-on tasks (Windsurf) vs. walk-away tasks (JSON parsing)
•Academic challenge: building compelling products (like LM Arena) or realistic user simulators to gather interaction data

Code Base Understanding and Context Engineering

•Code base understanding enables human-machine 'mind meld' for tasks neither can do alone
•Automatic context engineering acts as research sub-agent to optimize LLM task performance
•Benchmarking understanding is difficult - frozen repo with curated Q&A saturates quickly
•CodeClash can serve as testbed for human-AI interaction experiments with different team compositions
•Future research areas: multi-agent collaboration, human+agent teams, interaction patterns as models improve

Latent Space: The AI Engineer Podcast

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

0:00 / 0:00

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Description

Summary

Jump to Topic

SWE-bench Origin Story and Devin's Impact

CodeClash: Tournament-Based Long-Horizon Agent Evaluation

Emerging Code Benchmarks and Domain Specialization

TauBench Controversy and Impossible Tasks

Future of Code Evals: Long Autonomy vs Human Interaction

Code Base Understanding and Context Engineering

Navigate

Chat with Episode

Summary

Jump to Topic

SWE-bench Origin Story and Devin's Impact

CodeClash: Tournament-Based Long-Horizon Agent Evaluation

Emerging Code Benchmarks and Domain Specialization

TauBench Controversy and Impossible Tasks

Future of Code Evals: Long Autonomy vs Human Interaction

Code Base Understanding and Context Engineering

Navigate

Chat with Episode