Episode	Podcast	Published	Duration	Status

Training Data

Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg

January 6, 2026•3697•11,426 words•Sequoia Capital

Description

Physical Intelligence’s Karol Hausman and Tobi Springenberg believe that robotics has been held back not by hardware limitations, but by an intelligence bottleneck that foundation models can solve. Th...

Summary

Physical Intelligence is building foundation models that enable any robot to perform any task through end-to-end learning. Their latest model, π*0.6, uses reinforcement learning from real-world experience to achieve deployment-ready performance, with robots running autonomously for 13+ hours. The team has moved beyond classical perception-planning-control pipelines to vision-language-action models that generalize across radically different embodiments and tasks, from coffee-making to surgical robots.

Jump to Topic

Why Foundation Models Over Vertical Integration

Physical Intelligence chose to build foundation models rather than vertically integrated robots because intelligence has always been the bottleneck in robotics, not hardware. Teleoperated robots have demonstrated capable hardware for over a decade, but lack autonomous intelligence to operate reliably.

•Hardware capable of complex tasks already exists - the bottleneck is intelligence, not mechanics
•Better hardware raises the ceiling but doesn't solve the fundamental capability floor
•Focusing on intelligence enables deployment across multiple verticals vs. being locked into one application
•Three key challenges: capability (automating any task with data), generalization (zero-shot deployment), and performance (deployment-ready reliability)

Breaking Down the Classical Robotics Pipeline

The historical approach of decomposing robotics into perception, planning, and control components was fundamentally flawed. End-to-end learning from pixels to actions, combined with vision-language pre-training for common sense, represents the current state-of-the-art approach that Physical Intelligence employs.

•Classical pipeline approach with separate perception/planning/control modules doesn't match how humans actually operate
•The predefined interfaces between components were the breaking point, not the components themselves
•End-to-end learning requires massive data but avoids artificial decomposition
•Vision-language models provide common sense understanding without requiring firsthand experience of everything
•Current architecture: transformer-based VLM backbone plus action expert, trained predominantly on robotics data

Technical Architecture and Scaling Laws

Physical Intelligence's models use a vision-language-action architecture similar to VLMs, with images and text as input and actions as output. The architecture may evolve, but the fundamental approach of ingesting diverse data into a single large model appears stable, with scaling coming primarily from data diversity rather than just quantity.

•Architecture analogous to modern VLMs: transformer model up to billions of parameters
•Training predominantly on robotics data collected in-house, with some internet data mixed in
•Action expert component drives robot commands based on visual and language understanding
•Data quality and diversity matter more than pure quantity - same task done 10 ways vs. 10 different objects
•Models generalize unexpectedly well across radically different tasks (driving, surgery, manipulation)

π*0.6: Reinforcement Learning from Experience

The π*0.6 release introduces RL from real-world robot experience, moving beyond pure imitation learning. The system collects autonomous deployment data with human corrections and reward signals, achieving 2x throughput improvements and enabling 13-hour autonomous coffee-making runs.

•RL training done entirely in real world, not simulation, to capture long-tail failures
•Example: cardboard boxes sticking together in new shipments - impossible to predict in sim
•Value functions predict likely success/failure 30-40 steps ahead, enabling better credit assignment
•Human corrections (as few as 30-50 episodes) effectively fine-tune specific behaviors like tamping pressure
•Achieved deployment-ready reliability: 13 hours of coffee-making, 4 hours of laundry folding

Real World RL vs. Simulation

Physical Intelligence prioritizes real-world RL over simulation because manipulation tasks require modeling how the entire world reacts to robot actions, not just the robot's own body dynamics. Simulation works well for locomotion but fails to capture the long tail of manipulation failures.

•Locomotion RL transfers well from sim because the main challenge is modeling your own body
•Manipulation requires modeling every object and task - doesn't scale in simulation
•Real-world deployment reveals unexpected failures: boxes sticking together, imperfect perforations
•Value functions learn to predict failure before it's obvious to humans
•Temporal difference learning and value functions address the credit assignment problem without waiting for final rewards

Data Strategy and the Bootstrap Phase

Physical Intelligence is in a 'bootstrap phase' where any data source that improves models is valuable - sim, internet video, teleoperation. However, the long-term vision relies on deployment generating vastly more data than bootstrap sources, creating a self-improving data flywheel.

•Current phase: try everything (sim, internet video, teleoperation) to reach deployment threshold
•Deployment phase will generate orders of magnitude more data than bootstrap phase
•Data collection cost becomes negative when robots do economically valuable work
•Broader deployment = more diverse data = better generalization
•Internet-scale robot data will eventually dwarf internet video/text data

Generalization Across Embodiments and Tasks

The models demonstrate surprising generalization across radically different robot forms and tasks - from surgical robots to drones to manipulation - in ways not fully understood. Diversity of training data appears to be the key to zero-shot deployment in new environments.

•Models generalize to completely new environments and robot types beyond training distribution
•Open-sourcing revealed applications in driving, surgery, agriculture that team hadn't anticipated
•Zero-shot deployment in new homes works due to diverse training data
•Something fundamental about physical intelligence enables cross-task transfer
•Aperture of deployable applications wider than expected and growing with more data

Commercialization Strategy and Avoiding Application Lock-in

Physical Intelligence deliberately avoids picking specific applications early to prevent becoming narrowly focused like historical robotics companies. They prioritize expanding the technology's aperture and deployment readiness before determining commercialization models.

•History shows robotics startups get stuck in single applications and lose generality
•Focus on technology first: maximize aperture, ease of deployment, generalization
•Multiple commercialization paths possible: model provider, vertical solutions, robot sales
•Too early to commit to specific go-to-market - depends on how technology evolves
•Benefits of solving general physical intelligence far outweigh any single application

Timeline and Advantages Over Self-Driving

Unlike self-driving's 15+ year timeline requiring near-perfect reliability, robotics can deploy at 95% reliability for many tasks. The foundation model era provides common sense understanding that wasn't available during early self-driving development, potentially accelerating progress.

•Many tasks tolerable at 95% reliability (laundry folding) vs. self-driving's need for 99.9%+
•Foundation models with common sense didn't exist during early self-driving era
•Can benefit from 15+ years of lessons learned in autonomous systems
•General-purpose approach may be easier than expected due to surprising cross-task transfer
•Already crossed deployment threshold faster than anticipated (2 months vs. expected 5 years)

The Fundamental Surprise of End-to-End Learning

The team reflects on how mind-blowing it is that general-purpose learning algorithms trained end-to-end can achieve intelligence across vision, language, robotics, and more. This represents a fundamental shift from decomposing problems into subcomponents to letting models learn holistically from data.

•Loosely brain-inspired architectures with general learning algorithms somehow 'get it' across domains
•Works for robots, vision, language, sound - far beyond what decomposition approaches achieved
•Similar to biological intelligence: humans and crows learn through experience, not pre-baked rules
•Incorporating known physics or rules seems to limit learning ability rather than help
•The fact that this works at all - pixels to actions, zero-shot generalization - is fundamentally surprising

Training Data

Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg

0:00 / 0:00

Summary

Jump to Topic

Why Foundation Models Over Vertical Integration

•Hardware capable of complex tasks already exists - the bottleneck is intelligence, not mechanics
•Better hardware raises the ceiling but doesn't solve the fundamental capability floor
•Focusing on intelligence enables deployment across multiple verticals vs. being locked into one application
•Three key challenges: capability (automating any task with data), generalization (zero-shot deployment), and performance (deployment-ready reliability)

Breaking Down the Classical Robotics Pipeline

•Classical pipeline approach with separate perception/planning/control modules doesn't match how humans actually operate
•The predefined interfaces between components were the breaking point, not the components themselves
•End-to-end learning requires massive data but avoids artificial decomposition
•Vision-language models provide common sense understanding without requiring firsthand experience of everything
•Current architecture: transformer-based VLM backbone plus action expert, trained predominantly on robotics data

Technical Architecture and Scaling Laws

•Architecture analogous to modern VLMs: transformer model up to billions of parameters
•Training predominantly on robotics data collected in-house, with some internet data mixed in
•Action expert component drives robot commands based on visual and language understanding
•Data quality and diversity matter more than pure quantity - same task done 10 ways vs. 10 different objects
•Models generalize unexpectedly well across radically different tasks (driving, surgery, manipulation)

π*0.6: Reinforcement Learning from Experience

•RL training done entirely in real world, not simulation, to capture long-tail failures
•Example: cardboard boxes sticking together in new shipments - impossible to predict in sim
•Value functions predict likely success/failure 30-40 steps ahead, enabling better credit assignment
•Human corrections (as few as 30-50 episodes) effectively fine-tune specific behaviors like tamping pressure
•Achieved deployment-ready reliability: 13 hours of coffee-making, 4 hours of laundry folding

Real World RL vs. Simulation

•Locomotion RL transfers well from sim because the main challenge is modeling your own body
•Manipulation requires modeling every object and task - doesn't scale in simulation
•Real-world deployment reveals unexpected failures: boxes sticking together, imperfect perforations
•Value functions learn to predict failure before it's obvious to humans
•Temporal difference learning and value functions address the credit assignment problem without waiting for final rewards

Data Strategy and the Bootstrap Phase

•Current phase: try everything (sim, internet video, teleoperation) to reach deployment threshold
•Deployment phase will generate orders of magnitude more data than bootstrap phase
•Data collection cost becomes negative when robots do economically valuable work
•Broader deployment = more diverse data = better generalization
•Internet-scale robot data will eventually dwarf internet video/text data

Generalization Across Embodiments and Tasks

•Models generalize to completely new environments and robot types beyond training distribution
•Open-sourcing revealed applications in driving, surgery, agriculture that team hadn't anticipated
•Zero-shot deployment in new homes works due to diverse training data
•Something fundamental about physical intelligence enables cross-task transfer
•Aperture of deployable applications wider than expected and growing with more data

Commercialization Strategy and Avoiding Application Lock-in

•History shows robotics startups get stuck in single applications and lose generality
•Focus on technology first: maximize aperture, ease of deployment, generalization
•Multiple commercialization paths possible: model provider, vertical solutions, robot sales
•Too early to commit to specific go-to-market - depends on how technology evolves
•Benefits of solving general physical intelligence far outweigh any single application

Timeline and Advantages Over Self-Driving

•Many tasks tolerable at 95% reliability (laundry folding) vs. self-driving's need for 99.9%+
•Foundation models with common sense didn't exist during early self-driving era
•Can benefit from 15+ years of lessons learned in autonomous systems
•General-purpose approach may be easier than expected due to surprising cross-task transfer
•Already crossed deployment threshold faster than anticipated (2 months vs. expected 5 years)

The Fundamental Surprise of End-to-End Learning

•Loosely brain-inspired architectures with general learning algorithms somehow 'get it' across domains
•Works for robots, vision, language, sound - far beyond what decomposition approaches achieved
•Similar to biological intelligence: humans and crows learn through experience, not pre-baked rules
•Incorporating known physics or rules seems to limit learning ability rather than help
•The fact that this works at all - pixels to actions, zero-shot generalization - is fundamentally surprising

Training Data

Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg

0:00 / 0:00

Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg

Description

Summary

Jump to Topic

Why Foundation Models Over Vertical Integration

Breaking Down the Classical Robotics Pipeline

Technical Architecture and Scaling Laws

π*0.6: Reinforcement Learning from Experience

Real World RL vs. Simulation

Data Strategy and the Bootstrap Phase

Generalization Across Embodiments and Tasks

Commercialization Strategy and Avoiding Application Lock-in

Timeline and Advantages Over Self-Driving

The Fundamental Surprise of End-to-End Learning

Navigate

Chat with Episode

Summary

Jump to Topic

Why Foundation Models Over Vertical Integration

Breaking Down the Classical Robotics Pipeline

Technical Architecture and Scaling Laws

π*0.6: Reinforcement Learning from Experience

Real World RL vs. Simulation

Data Strategy and the Bootstrap Phase

Generalization Across Embodiments and Tasks

Commercialization Strategy and Avoiding Application Lock-in

Timeline and Advantages Over Self-Driving

The Fundamental Surprise of End-to-End Learning

Navigate

Chat with Episode