Episode	Podcast	Published	Duration	Status

Unsupervised Learning with Jacob Effron

Ep 79: OpenAI's Head of Product on How the Best Teams Build, Ship and Scale AI Products

December 10, 2025•56m•11,668 words•Olivier Godement, Jacob Effron

Description

This episode features Olivier Godement, Head of Product for Business Products at OpenAI, discussing the current state and future of AI adoption in enterprises, with a particular focus on the recent re...

Summary

Olivier Godement, Head of Product for Enterprise at OpenAI, discusses the current state of AI adoption in enterprises, focusing on GPT-5.1 and Codex releases. He reveals that while complete job automation remains challenging, specific domains like coding, customer support, and life sciences are reaching tipping points. Companies like Amgen are using AI to compress drug development timelines from months to weeks through automated regulatory documentation. The conversation explores the critical importance of scaffolding, harnesses, and evaluation frameworks, with Olivier predicting that continuous learning capabilities and standardized agent architectures will be the next major unlock for enterprise AI adoption.

Jump to Topic

GPT-5.1 and Codex Model Improvements: Speed, Intelligence, and Developer Adoption

Discussion of the new GPT-5.1 and Codex models, focusing on how 5.1 addressed speed concerns while maintaining intelligence by compressing thinking tokens. Codex has achieved remarkable adoption internally at OpenAI, with engineers pushing 70% more PRs. The models represent iterative improvements based on user feedback from GPT-5.

•GPT-5.1 designed to maintain intelligence while dramatically reducing latency for basic queries through compressed thinking tokens
•Codex enables OpenAI engineers to push 70% more pull requests, becoming essential to their workflow
•Engineers would 'riot' if AI coding tools were taken away - reached critical adoption threshold
•Model improvements driven by specific user feedback on speed vs intelligence tradeoffs

AI Breakthrough in Scientific Research and Life Sciences

Olivier identifies scientific research as a surprising breakthrough area, with researchers using LLMs to aggregate literature and accelerate hypothesis testing. A physicist used GPT-5 Pro to reproduce weeks of mathematical work in 30 minutes. Amgen is using AI to compress drug development timelines from months to weeks by automating regulatory documentation, representing a massive opportunity in pharma.

•First time seeing meaningful acceleration in scientific research - physicists reproducing weeks of work in 30 minutes
•Amgen compressing drug development admin work from months to weeks using AI for regulatory documentation
•Life sciences represents huge untapped market due to massive administrative burden in regulated industries
•Models excel at aggregating structured/unstructured data and spotting differences in complex documents

The Reality of AI Agents: Scaffolding, Harnesses, and Evaluation Frameworks

Response to Andre Karpathy's comments about agents being a decade away. While complete job automation is hard, specific fields like coding are reaching automation tipping points. Success requires extensive scaffolding, harnesses, evaluation frameworks, and human-in-the-loop feedback systems. T-Mobile and other enterprises are achieving meaningful scale in customer support.

•Building good agents requires extensive harnesses, tool connections, evaluation frameworks, and human feedback loops
•Coding has reached a tipping point where engineers would revolt if tools were removed
•Customer support and sales showing strong adoption at scale (T-Mobile case study)
•Complete job automation still requires substantial infrastructure beyond the base model

Enterprise AI Adoption Strategy: OpenAI vs Ecosystem Partners

Discussion of when OpenAI works directly with enterprises versus enabling the ecosystem. The complexity and depth of enterprise problems is enormous - no single company can solve everything. OpenAI is building Apps in ChatGPT to enable third-party integrations, allowing startups to benefit from ChatGPT's adoption and memory while building specialized features.

•Enterprise problems are too complex and deep for any single company to solve alone
•Apps in ChatGPT announced at Dev Day as major partnership vector with ecosystem
•Enterprises want employees to do more in ChatGPT's universal interface
•OpenAI focuses on enabling ecosystem rather than building every vertical solution

ChatGPT as Universal Enterprise Interface and Pulse Feature

Vision for ChatGPT becoming the first place enterprise workers check each morning. The Pulse feature has become transformative, preparing daily briefings with relevant emails, meetings, and papers. While ChatGPT won't replace every tool, it's becoming the central hub for productivity information and simple actions.

•Pulse feature now provides highly accurate morning briefings with emails, meetings, and relevant papers
•ChatGPT positioned as first website workers check in morning, not replacement for all tools
•Deep workflows will still happen in specialized tools (coding in IDEs, data in spreadsheets)
•Increasing number of simple actions being taken directly in ChatGPT

Next Model Frontiers: Continuous Learning as Game-Changer

Olivier operates on three time horizons: current capabilities, near-term post-training improvements, and fundamental breakthroughs. Continuous learning is identified as the next critical frontier - enabling models to update weights based on inference-time human feedback, like hiring an intern who learns on the job rather than requiring everything to be documented upfront.

•Three time horizons: current capabilities, post-training tweaks (months), fundamental breakthroughs (O1-level)
•Continuous learning will enable models to incorporate human feedback and improve over time
•Current approach relies on prompting; future models will update weights based on feedback
•Will unlock use cases in coding, customer experience, finance by enabling true learning agents

Product-Market Fit Categories and Going Deeper vs Wider

Core PMF categories remain coding, customer support, finance, and life sciences - but these are gigantic markets with unknown total addressable market due to cheap software creation. Strategy shifting from 'spray and pray' to doubling down on proven markets and going deeper. Customer support evolution from tier-one tickets to revenue generation through personalization.

•Main PMF categories: coding, customer support, finance, life sciences - all massive markets
•Software market size unknown due to historical engineer shortage - likely far larger than current spend
•Strategy evolved from exploring many domains to going deeper in proven markets
•Customer support opportunity: moving from cost center to revenue maker through personalization

Scaffolding Patterns and the Path to Standardization

Current scaffolding is highly bespoke across use cases, with teams trying various approaches (single/multiple agents, deterministic gates). No standard agent architecture has emerged yet, but OpenAI is working toward it. Code and coding capabilities are emerging as the most general-purpose capability, suggesting computer access via code execution will become standard.

•Scaffolding currently very bespoke - teams trying anything to make it work
•No standard agent architecture yet, but convergence expected as industry matures
•Code/coding emerging as most general-purpose capability beyond just software engineering
•Giving agents computer access through code execution likely to become industry standard

Cost Reduction as Critical Unlock for New Use Cases

OpenAI has reduced GPT-4 level query costs by 1-2 orders of magnitude in 2-3 years through model compression, better hardware, and GPU networking. High-stakes use cases like coding already have working economics, but many use cases (like personalized website content) are blocked by cost and latency. Every price cut reveals untapped demand larger than the revenue impact.

•Cost reduced by 1-2 orders of magnitude in 2-3 years across entire stack
•High-leverage use cases (coding) already economically viable at current prices
•Many use cases blocked by cost: personalized content, long-running agents
•Every price cut shows untapped demand exceeds revenue loss - massive latent demand exists

Reinforcement Fine-Tuning (RFT) Adoption and Use Cases

RFT not yet widely adopted - most enterprises still catching up to base model capabilities. Innovators at the frontier use RFT when blocked by base models. Example: accounting software achieved 20-30% improvement with few dozen high-quality samples, crossing threshold from 'not valuable' to 'valuable'. OpenAI released first RFT API but it remains heavy-handed and time-consuming.

•Most enterprises still catching up to GPT-5.1 base capabilities, haven't needed RFT yet
•Frontier innovators use RFT when base models insufficient for specific domains
•Accounting software case: few dozen samples yielded 20-30% improvement, crossing viability threshold
•RFT requires high-quality environments, good evals, can take hours/days - not yet mass market

Model Selection Criteria: Capabilities, Cost, and Twitter Vibes

Three main factors drive model selection: capabilities/behavior, cost/latency, and 'vibes' (Twitter/influencer sentiment). Academic benchmarks provide limited value for specific use cases - industry-specific benchmarks like TauBench emerging. Best teams develop strong qualitative taste for model nuances, similar to expertise in writing or painting. Industry reinventing Gartner-style trust mechanisms.

•Three selection factors: model capabilities, cost/latency, and social proof/vibes
•Academic benchmarks (MMLU, etc.) provide limited signal for specific use cases like customer support
•Industry-specific benchmarks emerging (TauBench for customer service, GDP-eval for economic tasks)
•Best teams develop strong qualitative taste for model nuances through extensive testing

Model Fatigue and the Challenge of Hot-Swapping Models

Days of simple API parameter swaps are gone for non-trivial use cases. Model idiosyncrasies across providers are increasingly distinct - different instruction formats, tool signatures, context handling. Even sophisticated startups struggle with frequent updates. Enterprises want predictable release cadences with clear changelogs, similar to traditional software versioning.

•Hot-swapping models no longer viable - each model has distinct idiosyncrasies
•Models differ in instruction response, tool signatures, long context recall
•Even top startups struggle to adapt to every model update - requires substantial work
•Industry rediscovering software deployment best practices: versioning, changelogs, predictable cadence

Voice AI: Crossing the Naturalness Threshold

GPT-4O in May/June was second major AGI breakthrough after ChatGPT, enabling tone and emotion understanding. However, voice hasn't crossed the Turing test yet - interruptions and cadence still feel unnatural. Next frontier is achieving naturalness where users are equally comfortable with AI as humans. Strong adoption in multilingual customer support where staffing every language is impossible.

•GPT-4O breakthrough: models can express and understand tone, emotion, personality
•Voice still hasn't crossed Turing test - interruptions and cadence feel unnatural
•Next frontier: naturalness where AI service feels equivalent to human service
•Multilingual capabilities critical for customer support - solves impossible staffing problems

Future of Software Engineering: Beyond Code Generation to Collaboration

Codex team exemplifies small, talented team singularly focused on use case. Current models excel at code generation, but software engineering involves much more: on-call, communication, scoping, architecture decisions, API duplication. Collaboration capabilities represent the next major unlock. Many enterprises still stuck on GitHub Copilot V1 due to security/compliance hurdles.

•Software engineering is more than code: on-call, communication, scoping, architecture decisions
•Collaboration capabilities are next major unlock beyond code generation
•Many enterprises haven't adopted agentic coding tools due to security/compliance processes
•2025 predicted as 'year of coding in the enterprise' with meaningful license provisioning

Model-Harness Symbiosis and Reference Architectures

Increasingly difficult to separate model quality from harness quality - best agents have models trained for specific harnesses. OpenAI open-sourcing reference harnesses and tool definitions (like Codex) to enable ecosystem adoption. Industry evolving from pure model inference APIs to providing models + harnesses + reference UI designs as complete blueprints.

•Best agents are symbiosis of model + harness - models trained for specific harness architectures
•OpenAI open-sourcing harnesses and tool definitions as reference implementations
•Industry shift: from model APIs to model + harness + UI reference designs
•Biggest learning: can't just drop models in API - need blueprints for maximal utilization

Enterprise AI Adoption: Buy vs Build and Data Infrastructure

Most enterprises will buy harnesses/solutions rather than build, except for core business use cases. Building production-grade agents requires enormous effort. Critical success factors: clean data infrastructure, rigorous evaluation frameworks, and proper change management. Most enterprise knowledge exists in people's heads, not documentation, making eval creation an iterative people-finding process.

•Enterprises will mostly buy solutions; build only for core business differentiators
•Three critical factors: data infrastructure, rigorous evaluation, change management
•Most enterprise knowledge in people's heads (20-30% documented) - finding right experts is key
•Evaluation is iterative process of finding domain experts, not converting text to evals

SORA API Early Adoption and Video Generation Use Cases

SORA API seeing strong adoption in ads/content generation and production studios. Production companies using it to quickly visualize concepts for brainstorming - showing 30-second visualizations of ideas accelerates creative collaboration. Video generation still in early innings due to cost and speed, but clear path to transforming creative workflows.

•Two main adoption areas: personalized ads/content and production studio brainstorming
•Production teams using 30-second visualizations to communicate creative vision faster
•Video generation still expensive and slow but showing clear transformation potential
•Early innings - waiting for cost/speed improvements to unlock broader use cases

Underhyped Opportunity: AI for Scientific Discovery and Drug Design

Scientific discovery and drug design identified as most underhyped opportunity. While software use cases feel natural to tech workers, accelerating scientific discovery is the substrate of all progress. Even 5% acceleration in discovery rate would have enormous compounding effects on economy and technology. Requires intersection of LLMs, lab infrastructure, data, and domain experts.

•Scientific discovery most underhyped - substrate of all technological progress
•5% acceleration in discovery rate would have massive compounding economic effects
•Requires combination of LLMs, lab infrastructure, data, and domain expertise
•Highest potential for long-term compounding benefits despite implementation challenges

Unsupervised Learning with Jacob Effron

Ep 79: OpenAI's Head of Product on How the Best Teams Build, Ship and Scale AI Products

0:00 / 0:00

View original episode →

Summary

Jump to Topic

GPT-5.1 and Codex Model Improvements: Speed, Intelligence, and Developer Adoption

•GPT-5.1 designed to maintain intelligence while dramatically reducing latency for basic queries through compressed thinking tokens
•Codex enables OpenAI engineers to push 70% more pull requests, becoming essential to their workflow
•Engineers would 'riot' if AI coding tools were taken away - reached critical adoption threshold
•Model improvements driven by specific user feedback on speed vs intelligence tradeoffs

AI Breakthrough in Scientific Research and Life Sciences

•First time seeing meaningful acceleration in scientific research - physicists reproducing weeks of work in 30 minutes
•Amgen compressing drug development admin work from months to weeks using AI for regulatory documentation
•Life sciences represents huge untapped market due to massive administrative burden in regulated industries
•Models excel at aggregating structured/unstructured data and spotting differences in complex documents

The Reality of AI Agents: Scaffolding, Harnesses, and Evaluation Frameworks

•Building good agents requires extensive harnesses, tool connections, evaluation frameworks, and human feedback loops
•Coding has reached a tipping point where engineers would revolt if tools were removed
•Customer support and sales showing strong adoption at scale (T-Mobile case study)
•Complete job automation still requires substantial infrastructure beyond the base model

Enterprise AI Adoption Strategy: OpenAI vs Ecosystem Partners

•Enterprise problems are too complex and deep for any single company to solve alone
•Apps in ChatGPT announced at Dev Day as major partnership vector with ecosystem
•Enterprises want employees to do more in ChatGPT's universal interface
•OpenAI focuses on enabling ecosystem rather than building every vertical solution

ChatGPT as Universal Enterprise Interface and Pulse Feature

•Pulse feature now provides highly accurate morning briefings with emails, meetings, and relevant papers
•ChatGPT positioned as first website workers check in morning, not replacement for all tools
•Deep workflows will still happen in specialized tools (coding in IDEs, data in spreadsheets)
•Increasing number of simple actions being taken directly in ChatGPT

Next Model Frontiers: Continuous Learning as Game-Changer

•Three time horizons: current capabilities, post-training tweaks (months), fundamental breakthroughs (O1-level)
•Continuous learning will enable models to incorporate human feedback and improve over time
•Current approach relies on prompting; future models will update weights based on feedback
•Will unlock use cases in coding, customer experience, finance by enabling true learning agents

Product-Market Fit Categories and Going Deeper vs Wider

•Main PMF categories: coding, customer support, finance, life sciences - all massive markets
•Software market size unknown due to historical engineer shortage - likely far larger than current spend
•Strategy evolved from exploring many domains to going deeper in proven markets
•Customer support opportunity: moving from cost center to revenue maker through personalization

Scaffolding Patterns and the Path to Standardization

•Scaffolding currently very bespoke - teams trying anything to make it work
•No standard agent architecture yet, but convergence expected as industry matures
•Code/coding emerging as most general-purpose capability beyond just software engineering
•Giving agents computer access through code execution likely to become industry standard

Cost Reduction as Critical Unlock for New Use Cases

•Cost reduced by 1-2 orders of magnitude in 2-3 years across entire stack
•High-leverage use cases (coding) already economically viable at current prices
•Many use cases blocked by cost: personalized content, long-running agents
•Every price cut shows untapped demand exceeds revenue loss - massive latent demand exists

Reinforcement Fine-Tuning (RFT) Adoption and Use Cases

•Most enterprises still catching up to GPT-5.1 base capabilities, haven't needed RFT yet
•Frontier innovators use RFT when base models insufficient for specific domains
•Accounting software case: few dozen samples yielded 20-30% improvement, crossing viability threshold
•RFT requires high-quality environments, good evals, can take hours/days - not yet mass market

Model Selection Criteria: Capabilities, Cost, and Twitter Vibes

•Three selection factors: model capabilities, cost/latency, and social proof/vibes
•Academic benchmarks (MMLU, etc.) provide limited signal for specific use cases like customer support
•Industry-specific benchmarks emerging (TauBench for customer service, GDP-eval for economic tasks)
•Best teams develop strong qualitative taste for model nuances through extensive testing

Model Fatigue and the Challenge of Hot-Swapping Models

•Hot-swapping models no longer viable - each model has distinct idiosyncrasies
•Models differ in instruction response, tool signatures, long context recall
•Even top startups struggle to adapt to every model update - requires substantial work
•Industry rediscovering software deployment best practices: versioning, changelogs, predictable cadence

Voice AI: Crossing the Naturalness Threshold

•GPT-4O breakthrough: models can express and understand tone, emotion, personality
•Voice still hasn't crossed Turing test - interruptions and cadence feel unnatural
•Next frontier: naturalness where AI service feels equivalent to human service
•Multilingual capabilities critical for customer support - solves impossible staffing problems

Future of Software Engineering: Beyond Code Generation to Collaboration

•Software engineering is more than code: on-call, communication, scoping, architecture decisions
•Collaboration capabilities are next major unlock beyond code generation
•Many enterprises haven't adopted agentic coding tools due to security/compliance processes
•2025 predicted as 'year of coding in the enterprise' with meaningful license provisioning

Model-Harness Symbiosis and Reference Architectures

•Best agents are symbiosis of model + harness - models trained for specific harness architectures
•OpenAI open-sourcing harnesses and tool definitions as reference implementations
•Industry shift: from model APIs to model + harness + UI reference designs
•Biggest learning: can't just drop models in API - need blueprints for maximal utilization

Enterprise AI Adoption: Buy vs Build and Data Infrastructure

•Enterprises will mostly buy solutions; build only for core business differentiators
•Three critical factors: data infrastructure, rigorous evaluation, change management
•Most enterprise knowledge in people's heads (20-30% documented) - finding right experts is key
•Evaluation is iterative process of finding domain experts, not converting text to evals

SORA API Early Adoption and Video Generation Use Cases

•Two main adoption areas: personalized ads/content and production studio brainstorming
•Production teams using 30-second visualizations to communicate creative vision faster
•Video generation still expensive and slow but showing clear transformation potential
•Early innings - waiting for cost/speed improvements to unlock broader use cases

Underhyped Opportunity: AI for Scientific Discovery and Drug Design

•Scientific discovery most underhyped - substrate of all technological progress
•5% acceleration in discovery rate would have massive compounding economic effects
•Requires combination of LLMs, lab infrastructure, data, and domain expertise
•Highest potential for long-term compounding benefits despite implementation challenges

Unsupervised Learning with Jacob Effron

Ep 79: OpenAI's Head of Product on How the Best Teams Build, Ship and Scale AI Products

0:00 / 0:00