Episode	Podcast	Published	Duration	Status

The MAD Podcast with Matt Turck

Open Source AI Strikes Back — Inside Ai2’s OLMo 3 ‘Thinking"

November 20, 2025•1h 28m•16,272 words

Description

In this special release episode, Matt sits down with Nathan Lambert and Luca Soldaini from Ai2 (the Allen Institute for AI) to break down one of the biggest open-source AI drops of the year: OLMo 3. A...

Summary

Nathan Lambert and Luca Soldaini from AI2 discuss the release of OLMo 3, a fully open-source AI model family that includes base models, instruction-tuned models, and reasoning models. Unlike typical 'open weights' releases, AI2 publishes complete training data, recipes, intermediate checkpoints, and evaluation frameworks. The conversation provides an unprecedented technical deep-dive into the six-stage pipeline from pre-training through reinforcement learning, while addressing the geopolitical shift in open-source AI as Chinese models like Qwen and DeepSeek dominate the landscape amid uncertainty around Meta's Llama future.

Jump to Topic

OLMo 3 Release Overview and Open Source Philosophy

Introduction to the OLMo 3 model family release, including 7B and 32B base models, thinking models, and instruct models. Discussion of AI2's commitment to full openness—releasing not just weights but all data, intermediate checkpoints, recipes, and evaluation frameworks—contrasting with typical 'open weights' releases from other labs.

•OLMo 3 includes base models (7B, 32B), thinking models, and instruct models with complete training artifacts
•32B base model is competitive with Qwen 2.5 32B, filling a gap as many labs stop releasing base models
•7B instruct model aims to match or exceed Llama 3.1 8B performance with full transparency
•First fully open reasoning model showing RL on base models and distillation from larger thinking models
•DOMA 3 dataset includes 10T token pre-training pool, mid-training data, and 600B tokens of long documents (8K+ tokens)

The Rise of Chinese Open Source AI and US Response

Analysis of how DeepSeek's January release catalyzed an explosion of Chinese open-source models (Qwen, Kimi, DeepSeek) while Meta's Llama future became uncertain. Discussion of the strategic vacuum in US open-source AI and emerging responses including the ADAM project and increased investment from players like NVIDIA and Reflection.

•DeepSeek's January release showed AI could be built efficiently, triggering mass adoption in China
•Qwen, DeepSeek, and Kimi have absorbed the influence vacuum left by uncertainty around Llama's future
•80% of companies building with open models are using Qwen (16-24% of total portfolio companies)
•Chinese companies release openly because US enterprises won't pay for services, but will use their models
•US response emerging through projects like ADAM and NVIDIA's $152M investment in Reflection
•Cost of competitive open models is relatively small (~0.01%) compared to trillion-dollar AI infrastructure buildout

Understanding Thinking Models and Inference-Time Scaling

Explanation of what thinking models are—models trained to spend more compute at inference time through long chains of thought, resulting in step-change improvements on math, coding, and agentic tasks. Discussion of why they're becoming the industry standard despite being less 'fun' to build than regular instruct models.

•Thinking models exploit inference-time scaling by generating hidden reasoning traces (often 30,000+ tokens)
•Enable dramatic improvements on math, code, and agentic tasks compared to direct-answer models
•90% of use cases benefit from spending time for better answers, though fast models still have their place
•Building thinking models is the gateway to more advanced capabilities like agentic code execution and tool use
•Gemini Flash adoption shows continued demand for fast, non-thinking models for certain applications

AI2's History and Mission: From Paul Allen to Open Language Models

Background on the Allen Institute for AI, founded by Paul Allen in 2014, and its evolution from science-focused AI to becoming a leader in open language models. Discussion of AI2's grassroots initiative in November 2022 to build fully open models, securing initial compute from AMD, and the organization's ~100-person team structure.

•AI2 founded in 2014 by Paul Allen, initially focused on machines that can understand and do science
•Early open-source contributions included AllenNLP library (competitor to Hugging Face Transformers)
•OLMo project started as grassroots initiative by individual researchers in November 2022 (same time as ChatGPT)
•Initial 2M GPU hours grant from AMD enabled first OLMo models
•Current focus areas: OLMo model family, ASTA (agents for scientific tasks), and AI for environment
•~100 person organization including research staff, engineering, and support roles

Pre-Training Pipeline: Data Selection and Methodical Execution

Deep dive into pre-training methodology, including the constraint-driven approach of fixing compute budget and duration (2 months max), then optimizing data selection from a 30T token pool down to 6T tokens. Discussion of intelligent token repetition, domain balancing, and the critical importance of avoiding training spikes that would require restarts.

•Pre-training runs capped at 2 months to manage risk, requiring extensive preparation and methodical planning
•Started with 30T token pool, filtered down to 6T tokens using deduplication and quality scoring
•Intelligent token repetition: repeat high-value tokens rather than random documents
•Domain balancing based on evaluation targets (e.g., medical documents for medical benchmarks)
•Extensive work to prevent 'spikes' (catastrophic forgetting events) that would require restart from scratch
•All data must be publicly available and releasable, constraining choices but ensuring full openness

Mid-Training and Long Context Extension Techniques

Explanation of mid-training (also called 'tail patching') to add capabilities the model didn't learn during pre-training, and the critical importance of long context extension for reasoning models. Discussion of architectural decisions that matter more than data quality, and extending from 4K to 65K+ token context windows.

•Mid-training patches model at end of pre-training with high-quality math/code data while preventing forgetting
•Careful decontamination required to avoid test set leakage while teaching problem-solving patterns
•Long context extension essential for reasoning models that generate 30K+ token thinking traces
•Architecture choices (QK-norm, GQA) more critical than data quality for long context capability
•Extended from 4K tokens (~8 pages) to 65K+ tokens, with frontier models reaching 1M+ tokens
•Long context enables applications to pass all information to model without complex retrieval systems

Post-Training Stage 1: Supervised Fine-Tuning via Distillation

Detailed explanation of supervised fine-tuning (SFT) for small reasoning models through distillation from larger teachers (DeepSeek R1, Qwen QWQ). Discussion of generating 2.5M reasoning traces, the practicality of using Chinese models as teachers due to licensing and quality, and how this stage provides the foundation for 90% of model performance.

•Small models can't learn reasoning from scratch via RL—need distillation from stronger teacher models
•Generated 2.5M reasoning traces from DeepSeek R1, R1-0528, and Qwen QWQ across math, code, and general tasks
•Chinese models used as teachers due to permissive licenses and frontier-level reasoning performance
•SFT uses same next-token prediction loss as pre-training, just on high-quality reasoning demonstrations
•This stage provides ~90% of final performance—subsequent stages extract remaining gains
•Models learn to generate 30K+ token reasoning traces that would take hours for humans to read

Post-Training Stage 2: Direct Preference Optimization (DPO)

Explanation of DPO (Direct Preference Optimization) as a simpler alternative to RLHF that works surprisingly well on reasoning models. Discussion of the 'delta learning hypothesis'—that contrast between chosen/rejected examples matters more than absolute quality—and the challenge of finding sufficient variance as open models improve.

•DPO is analytically derived loss function that applies SGD to RLHF objective, much simpler to implement
•Successfully applied to reasoning models despite uncertainty about long trace behavior
•Delta learning hypothesis: contrast between completions more important than absolute quality
•Used Qwen 32B and Qwen 0.6B models to create preference pairs with sufficient performance delta
•As open models improve, finding sufficient variance for learning signal becomes harder
•DPO stage provided gains equivalent to jumping from Qwen 2.5 to Qwen 3 level performance

Post-Training Stage 3: Reinforcement Learning with Verifiable Rewards

Deep dive into RLVR (Reinforcement Learning with Verifiable Rewards) using correctness-based rewards rather than human preference models. Discussion of the extreme infrastructure challenges with long-context RL, numerical stability issues between VLM and training frameworks, and why this stage is critical for future model development despite modest immediate gains.

•RLVR uses verifiable rewards (correct/incorrect answers) rather than learned reward models from RLHF
•Verifiable rewards avoid over-optimization problems and style biases inherent in learned reward models
•Long context RL extremely challenging: quadratic memory/compute scaling, distributed error handling, kernel mismatches
•Numerical differences between VLM (generation) and training frameworks cause stability issues
•Most labs use evolved versions of GRPO; 'secret sauce' is really stable implementation details, not algorithms
•Infrastructure work critical for future larger OLMo models, even if immediate performance gains modest

AGI Timeline Debate: Complexity Tax vs. Continuous Progress

Discussion of AGI timelines and the tension between rapid progress and increasing system complexity. Both researchers express belief in transformative AI by 2030 but reject discontinuous 'singularity' scenarios, arguing that physical constraints, complexity tax, and messy co-evolution will result in smooth but dramatic progress rather than sudden jumps.

•Both researchers 'AGI-pilled' but skeptical of singularity/discontinuity scenarios like Situational Awareness 2027
•Expect 95-98% of achievable LLM+scaffolding value realized by 2030 within physical power constraints
•Progress will be smooth trajectory of refinements, not discrete breakthrough moments visible on Twitter
•Complexity tax: as systems add layers (tools, products, scaffolding), pace of change naturally slows
•Physical constraints on data center buildout and power generation will cap pure scaling approaches
•Massive societal transformation coming regardless of AGI definition—focus should be on understanding and preparing
•Scaffolding work enables broad participation beyond frontier labs in shaping AI's impact

The MAD Podcast with Matt Turck

Open Source AI Strikes Back — Inside Ai2’s OLMo 3 ‘Thinking"

0:00 / 0:00

View original episode →

Summary

Jump to Topic

OLMo 3 Release Overview and Open Source Philosophy

•OLMo 3 includes base models (7B, 32B), thinking models, and instruct models with complete training artifacts
•32B base model is competitive with Qwen 2.5 32B, filling a gap as many labs stop releasing base models
•7B instruct model aims to match or exceed Llama 3.1 8B performance with full transparency
•First fully open reasoning model showing RL on base models and distillation from larger thinking models
•DOMA 3 dataset includes 10T token pre-training pool, mid-training data, and 600B tokens of long documents (8K+ tokens)

The Rise of Chinese Open Source AI and US Response

•DeepSeek's January release showed AI could be built efficiently, triggering mass adoption in China
•Qwen, DeepSeek, and Kimi have absorbed the influence vacuum left by uncertainty around Llama's future
•80% of companies building with open models are using Qwen (16-24% of total portfolio companies)
•Chinese companies release openly because US enterprises won't pay for services, but will use their models
•US response emerging through projects like ADAM and NVIDIA's $152M investment in Reflection
•Cost of competitive open models is relatively small (~0.01%) compared to trillion-dollar AI infrastructure buildout

Understanding Thinking Models and Inference-Time Scaling

•Thinking models exploit inference-time scaling by generating hidden reasoning traces (often 30,000+ tokens)
•Enable dramatic improvements on math, code, and agentic tasks compared to direct-answer models
•90% of use cases benefit from spending time for better answers, though fast models still have their place
•Building thinking models is the gateway to more advanced capabilities like agentic code execution and tool use
•Gemini Flash adoption shows continued demand for fast, non-thinking models for certain applications

AI2's History and Mission: From Paul Allen to Open Language Models

•AI2 founded in 2014 by Paul Allen, initially focused on machines that can understand and do science
•Early open-source contributions included AllenNLP library (competitor to Hugging Face Transformers)
•OLMo project started as grassroots initiative by individual researchers in November 2022 (same time as ChatGPT)
•Initial 2M GPU hours grant from AMD enabled first OLMo models
•Current focus areas: OLMo model family, ASTA (agents for scientific tasks), and AI for environment
•~100 person organization including research staff, engineering, and support roles

Pre-Training Pipeline: Data Selection and Methodical Execution

•Pre-training runs capped at 2 months to manage risk, requiring extensive preparation and methodical planning
•Started with 30T token pool, filtered down to 6T tokens using deduplication and quality scoring
•Intelligent token repetition: repeat high-value tokens rather than random documents
•Domain balancing based on evaluation targets (e.g., medical documents for medical benchmarks)
•Extensive work to prevent 'spikes' (catastrophic forgetting events) that would require restart from scratch
•All data must be publicly available and releasable, constraining choices but ensuring full openness

Mid-Training and Long Context Extension Techniques

•Mid-training patches model at end of pre-training with high-quality math/code data while preventing forgetting
•Careful decontamination required to avoid test set leakage while teaching problem-solving patterns
•Long context extension essential for reasoning models that generate 30K+ token thinking traces
•Architecture choices (QK-norm, GQA) more critical than data quality for long context capability
•Extended from 4K tokens (~8 pages) to 65K+ tokens, with frontier models reaching 1M+ tokens
•Long context enables applications to pass all information to model without complex retrieval systems

Post-Training Stage 1: Supervised Fine-Tuning via Distillation

•Small models can't learn reasoning from scratch via RL—need distillation from stronger teacher models
•Generated 2.5M reasoning traces from DeepSeek R1, R1-0528, and Qwen QWQ across math, code, and general tasks
•Chinese models used as teachers due to permissive licenses and frontier-level reasoning performance
•SFT uses same next-token prediction loss as pre-training, just on high-quality reasoning demonstrations
•This stage provides ~90% of final performance—subsequent stages extract remaining gains
•Models learn to generate 30K+ token reasoning traces that would take hours for humans to read

Post-Training Stage 2: Direct Preference Optimization (DPO)

•DPO is analytically derived loss function that applies SGD to RLHF objective, much simpler to implement
•Successfully applied to reasoning models despite uncertainty about long trace behavior
•Delta learning hypothesis: contrast between completions more important than absolute quality
•Used Qwen 32B and Qwen 0.6B models to create preference pairs with sufficient performance delta
•As open models improve, finding sufficient variance for learning signal becomes harder
•DPO stage provided gains equivalent to jumping from Qwen 2.5 to Qwen 3 level performance

Post-Training Stage 3: Reinforcement Learning with Verifiable Rewards

•RLVR uses verifiable rewards (correct/incorrect answers) rather than learned reward models from RLHF
•Verifiable rewards avoid over-optimization problems and style biases inherent in learned reward models
•Long context RL extremely challenging: quadratic memory/compute scaling, distributed error handling, kernel mismatches
•Numerical differences between VLM (generation) and training frameworks cause stability issues
•Most labs use evolved versions of GRPO; 'secret sauce' is really stable implementation details, not algorithms
•Infrastructure work critical for future larger OLMo models, even if immediate performance gains modest

AGI Timeline Debate: Complexity Tax vs. Continuous Progress

•Both researchers 'AGI-pilled' but skeptical of singularity/discontinuity scenarios like Situational Awareness 2027
•Expect 95-98% of achievable LLM+scaffolding value realized by 2030 within physical power constraints
•Progress will be smooth trajectory of refinements, not discrete breakthrough moments visible on Twitter
•Complexity tax: as systems add layers (tools, products, scaffolding), pace of change naturally slows
•Physical constraints on data center buildout and power generation will cap pure scaling approaches
•Massive societal transformation coming regardless of AGI definition—focus should be on understanding and preparing
•Scaffolding work enables broad participation beyond frontier labs in shaping AI's impact

The MAD Podcast with Matt Turck

Open Source AI Strikes Back — Inside Ai2’s OLMo 3 ‘Thinking"

0:00 / 0:00

Open Source AI Strikes Back — Inside Ai2’s OLMo 3 ‘Thinking"

Description

Summary

Jump to Topic

OLMo 3 Release Overview and Open Source Philosophy

The Rise of Chinese Open Source AI and US Response

Understanding Thinking Models and Inference-Time Scaling

AI2's History and Mission: From Paul Allen to Open Language Models

Pre-Training Pipeline: Data Selection and Methodical Execution

Mid-Training and Long Context Extension Techniques

Post-Training Stage 1: Supervised Fine-Tuning via Distillation

Post-Training Stage 2: Direct Preference Optimization (DPO)

Post-Training Stage 3: Reinforcement Learning with Verifiable Rewards

AGI Timeline Debate: Complexity Tax vs. Continuous Progress

Navigate

Chat with Episode

Summary

Jump to Topic

OLMo 3 Release Overview and Open Source Philosophy

The Rise of Chinese Open Source AI and US Response

Understanding Thinking Models and Inference-Time Scaling

AI2's History and Mission: From Paul Allen to Open Language Models

Pre-Training Pipeline: Data Selection and Methodical Execution

Mid-Training and Long Context Extension Techniques

Post-Training Stage 1: Supervised Fine-Tuning via Distillation

Post-Training Stage 2: Direct Preference Optimization (DPO)

Post-Training Stage 3: Reinforcement Learning with Verifiable Rewards

AGI Timeline Debate: Complexity Tax vs. Continuous Progress

Navigate

Chat with Episode