| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
This episode is a little different from our usual fare: It’s a conversation with our head of AI training Alex Duffy about Good Start Labs, a company he incubated inside Every. Today, Good Start Labs i...
Alex Duffy, head of AI training at Every, discusses spinning out Good Start Labs—a $3.6M funded company using games to evaluate and train AI models. The conversation explores how games like Diplomacy reveal model personalities and capabilities better than static benchmarks, why prompt optimization matters more than raw model performance, and how games can bridge the gap between AI capabilities and human understanding while making models more trustworthy and effective.
Alex explains how Good Start Labs emerged from his AI training work at Every. The team built an AI version of Diplomacy that launched on Twitch with 50K viewers and became Every's most-read article of 2024, demonstrating how games can serve as both evaluation tools and training arenas for AI models.
Analysis of how different AI models behaved in Diplomacy reveals distinct personalities and strategies. O3 and Llama 4 were the biggest schemers, Claude was too honest to win, and Gemini 2.5 Pro excelled at execution. The game tests everything from structured outputs to long-term strategy and deception.
Building AI agents for games involves three infinite problem spaces: information representation, tool design, and prompting. GPT-5 showed the biggest performance jump from baseline to optimized prompts of any model, demonstrating that prompt engineering is an underrated skill that can dramatically change outcomes.
Good Start Labs works with companies like Cohere and OpenAI to evaluate models through rich game environments, then uses those same games as training arenas. Research shows vision models trained on games can outperform models trained directly on tasks like math when prompted to think of games as structured problems.
Alex shares how games taught him critical skills—RuneScape taught markets and scam detection, while the 24 Game taught multiplication. Good Start Labs is launching a prompting tournament where diplomacy champions and AI experts compete to optimize their agents, exploring the infinite prompt space collaboratively.
Discussion of how to evaluate trustworthiness when models should lie in games but not in other contexts. Companies can tune their training approach—either reinforcing honesty or optimizing for performance. The key is having flexible game environments where you can adjust rules and reinforcement to match your values.
The next game will be a Cards Against Humanity-style subjective game targeting humor—a known weakness in current models. The format may involve people prompting models to be funny or playing against them, with voting to determine what's actually humorous.
Alex shares concerns about how AI makes experts 10-100x more powerful but also does the work of juniors, potentially eliminating the path to expertise. Dan counters with Alex's own story—joining Every as a weak writer and making 2 years of progress in 3-4 months through AI-assisted learning and mentorship.
Alex highlights Google's constrained innovation (fast, reliable search requirements driving quality), Genie's rendering capabilities, and three areas seeing massive near-term AI impact: software (compilers enable RL), life sciences (AlphaFold reduced 6-year PhD work to 20 minutes), and education (AI tutors for curious kids).
Discussion of VR gaming experiences (Population One as physical Fortnite), Ray-Ban Meta glasses as more human technology (hands-free, present in moments), and anticipation for GTA VI as a billion-dollar cultural moment. Emphasis on shared experiences over personalized AI-generated content.
We Taught AI to Play Games—Now It’s a $3.6 Million Company
Ask me anything about this podcast episode...
Try asking: