Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

December 26, 2025•5,611 words

Description

From the frontlines of OpenAI's Codex and GPT-5 training teams, Bryan and Bill are building the future of AI-powered coding—where agents don't just autocomplete, they architect, refactor, and ship ent...

Summary

Bryan Fioca and Bill Chen from OpenAI's Codex and GPT-5 training teams discuss the launch of Codex Max, a long-running coding agent that can work for 24+ hours. They reveal how personality traits like communication, planning, and tool preferences are trained into models, why Codex is optimized for specific tools (it literally prefers 'rg' over 'grep'), and how the abstraction layer is moving from models to agents. The conversation covers multi-agent architectures, the importance of real-world evals, and a vision where coding agents become trusted teammates capable of handling complex refactors and integrations autonomously.

Jump to Topic

Training Models with Personality: Communication, Planning, and Trust

Deep dive into how OpenAI trains GPT-5 and Codex with specific behavioral characteristics beyond raw capability. The team focuses on personality traits like communication (keeping users informed), planning (strategy before execution), and checking work—essentially teaching models software engineering best practices as behaviors that can be measured and graded.

•Personality training focuses on building developer trust through predictable, collaborative behavior
•Key behavioral characteristics: communication during work, strategic planning, context gathering, and self-checking
•Models are evaluated on these behaviors, not just code correctness
•Close collaboration with bleeding-edge coding partners informs specific training priorities

Tool Training and Model Habits: Why Codex Loves Ripgrep Over Grep

Codex develops specific tool preferences during training, similar to human habits. The model performs better with 'rg' (ripgrep) than 'grep' because of training patterns. Partners discovered they can improve tool call performance by naming custom tools the same way as terminal tools Codex was trained on, revealing how models can be 'bent' in unexpected ways.

•Codex is trained with strong opinions about specific tools and performs measurably better with familiar ones
•Naming custom tools to match Codex's training (e.g., using terminal tool names) dramatically improves performance
•GPT-5 mainline models are more general and steerable across different tool sets
•Model habits are real—switching tools slows performance just like switching a human's familiar editor

Codex vs GPT-5: Specialized Agents vs General Models

Clarification of the difference between Codex (frontier coding model optimized for its specific harness) and GPT-5 mainline models (more general, steerable across tools). Codex comes with firm opinions on implementation, which some partners appreciate, while GPT-5 offers broader flexibility for custom integrations.

•Codex is optimized for OpenAI's specific harness and API, designed as a complete coding agent
•GPT-5 mainline is more general, better for custom tools and broader use cases beyond coding
•Codex SDK and models are recommended for bleeding-edge coding-focused applications
•Both model lines work together to maintain coding capabilities while serving different use cases

Abstraction Layer Moving to Agents: Packaging Complete Solutions

Major trend: the abstraction layer is moving from the model level to the agent level. Instead of optimizing for every model release, developers can now plug in complete agents like Codex into platforms. This enables agents to use other agents—for example, a chatbot spawning a Codex instance to write custom plugins or integrations on demand.

•Platforms like VS Code and Zed now support packaging entire agents, not just models
•Reduces need for teams to keep up with every model release and API change
•Enables meta-capabilities: agents that can create tools they need by spawning coding agents
•Example use case: software that writes custom integrations for customer APIs at launch time

Codex Max: 24+ Hour Runs, Context Management, and Sub-Agents

Introduction to Codex Max, which can run for 24+ hours (tested for multiple days), manages its own context window indefinitely, and is designed to spawn sub-agents for parallel work. The 'Max' name reflects both speed and maximization—it's faster at solving problems while also capable of extended runs.

•Codex Max can run continuously for 24+ hours with automatic context management
•Built to spawn sub-agents and hand off context for parallel work across codebases
•Despite long-running capability, it's also faster at solving the same problems than previous versions
•Designed as infrastructure for multi-agent architectures and autonomous development workflows

Real-World Evals and Applied Research: Beyond Academic Benchmarks

OpenAI's Applied Evals team focuses on capturing real-world use cases beyond academic benchmarks like SWE-bench. The approach treats models like PhD students who need job descriptions (prompts), mentorship, and performance reviews. Multi-turn evals are emerging as critical, with techniques like LLM-as-judge for entire trajectories and 'job interview evals' that test underspecified problem-solving.

•Applied Evals captures real-world use cases to ensure models make useful impact, not just benchmark scores
•Multi-turn eval strategies: LLM-as-judge on full trajectories, self-improvement loops, agentic harnesses
•Job interview eval concept: test agents on underspecified problems, constraint clarification, and iterative modifications
•Feature request: batch API for multi-turn evals to enable cheap overnight testing at scale

Coding Agents as General Computer Use: Beyond Software Development

Coding agents are breaking out of pure software development into general personal automation. Before GUIs, all computer interaction was through terminals and code—coding agents are essentially computer-use agents for the terminal. Use cases include email management, file organization, video clip extraction, and any task that can be automated through CLI tools.

•Coding agents work for any terminal-based task, not just software development
•Real examples: organizing messy directories, desktop cleanup, email sorting, video processing
•Historical parallel: coding agents restore the power of CLI automation from pre-GUI computing
•Vision-native capabilities still lacking—agents should use computer vision more for broader automation

2026 Predictions: Vision-Native Agents and Democratized Elite Development

Looking ahead to 2026: coding agents will become vision-native to handle applications without APIs, enabling integration with legacy systems through UI automation. The ultimate goal is democratizing access to elite-level development capabilities—every team, from small dev shops to major firms, should have access to the kind of technical expertise currently only available at top-tier companies.

•Vision-native computer use will enable agents to work with applications that only have UIs, no APIs
•More general use cases and extensible sub-agent architectures coming in 2026
•Goal: trusted coding agents that can handle complex refactors and architectural decisions autonomously
•Vision of democratization: small teams in Alaska getting same capabilities as OpenAI's top developers

Latent Space: The AI Engineer Podcast

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

0:00 / 0:00

View original episode →

Summary

Jump to Topic

Training Models with Personality: Communication, Planning, and Trust

•Personality training focuses on building developer trust through predictable, collaborative behavior
•Key behavioral characteristics: communication during work, strategic planning, context gathering, and self-checking
•Models are evaluated on these behaviors, not just code correctness
•Close collaboration with bleeding-edge coding partners informs specific training priorities

Tool Training and Model Habits: Why Codex Loves Ripgrep Over Grep

•Codex is trained with strong opinions about specific tools and performs measurably better with familiar ones
•Naming custom tools to match Codex's training (e.g., using terminal tool names) dramatically improves performance
•GPT-5 mainline models are more general and steerable across different tool sets
•Model habits are real—switching tools slows performance just like switching a human's familiar editor

Codex vs GPT-5: Specialized Agents vs General Models

•Codex is optimized for OpenAI's specific harness and API, designed as a complete coding agent
•GPT-5 mainline is more general, better for custom tools and broader use cases beyond coding
•Codex SDK and models are recommended for bleeding-edge coding-focused applications
•Both model lines work together to maintain coding capabilities while serving different use cases

Abstraction Layer Moving to Agents: Packaging Complete Solutions

•Platforms like VS Code and Zed now support packaging entire agents, not just models
•Reduces need for teams to keep up with every model release and API change
•Enables meta-capabilities: agents that can create tools they need by spawning coding agents
•Example use case: software that writes custom integrations for customer APIs at launch time

Codex Max: 24+ Hour Runs, Context Management, and Sub-Agents

•Codex Max can run continuously for 24+ hours with automatic context management
•Built to spawn sub-agents and hand off context for parallel work across codebases
•Despite long-running capability, it's also faster at solving the same problems than previous versions
•Designed as infrastructure for multi-agent architectures and autonomous development workflows

Real-World Evals and Applied Research: Beyond Academic Benchmarks

•Applied Evals captures real-world use cases to ensure models make useful impact, not just benchmark scores
•Multi-turn eval strategies: LLM-as-judge on full trajectories, self-improvement loops, agentic harnesses
•Job interview eval concept: test agents on underspecified problems, constraint clarification, and iterative modifications
•Feature request: batch API for multi-turn evals to enable cheap overnight testing at scale

Coding Agents as General Computer Use: Beyond Software Development

•Coding agents work for any terminal-based task, not just software development
•Real examples: organizing messy directories, desktop cleanup, email sorting, video processing
•Historical parallel: coding agents restore the power of CLI automation from pre-GUI computing
•Vision-native capabilities still lacking—agents should use computer vision more for broader automation

2026 Predictions: Vision-Native Agents and Democratized Elite Development

•Vision-native computer use will enable agents to work with applications that only have UIs, no APIs
•More general use cases and extensible sub-agent architectures coming in 2026
•Goal: trusted coding agents that can handle complex refactors and architectural decisions autonomously
•Vision of democratization: small teams in Alaska getting same capabilities as OpenAI's top developers

Latent Space: The AI Engineer Podcast

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

0:00 / 0:00

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

Description

Summary

Jump to Topic

Training Models with Personality: Communication, Planning, and Trust

Tool Training and Model Habits: Why Codex Loves Ripgrep Over Grep

Codex vs GPT-5: Specialized Agents vs General Models

Abstraction Layer Moving to Agents: Packaging Complete Solutions

Codex Max: 24+ Hour Runs, Context Management, and Sub-Agents

Real-World Evals and Applied Research: Beyond Academic Benchmarks

Coding Agents as General Computer Use: Beyond Software Development

2026 Predictions: Vision-Native Agents and Democratized Elite Development

Navigate

Chat with Episode

Summary

Jump to Topic

Training Models with Personality: Communication, Planning, and Trust

Tool Training and Model Habits: Why Codex Loves Ripgrep Over Grep

Codex vs GPT-5: Specialized Agents vs General Models

Abstraction Layer Moving to Agents: Packaging Complete Solutions

Codex Max: 24+ Hour Runs, Context Management, and Sub-Agents

Real-World Evals and Applied Research: Beyond Academic Benchmarks

Coding Agents as General Computer Use: Beyond Software Development

2026 Predictions: Vision-Native Agents and Democratized Elite Development

Navigate

Chat with Episode