| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
Physical Intelligence’s Karol Hausman and Tobi Springenberg believe that robotics has been held back not by hardware limitations, but by an intelligence bottleneck that foundation models can solve. Th...
Physical Intelligence is building foundation models that enable any robot to perform any task through end-to-end learning. Their latest model, π*0.6, uses reinforcement learning from real-world experience to achieve deployment-ready performance, with robots running autonomously for 13+ hours. The team has moved beyond classical perception-planning-control pipelines to vision-language-action models that generalize across radically different embodiments and tasks, from coffee-making to surgical robots.
Physical Intelligence chose to build foundation models rather than vertically integrated robots because intelligence has always been the bottleneck in robotics, not hardware. Teleoperated robots have demonstrated capable hardware for over a decade, but lack autonomous intelligence to operate reliably.
The historical approach of decomposing robotics into perception, planning, and control components was fundamentally flawed. End-to-end learning from pixels to actions, combined with vision-language pre-training for common sense, represents the current state-of-the-art approach that Physical Intelligence employs.
Physical Intelligence's models use a vision-language-action architecture similar to VLMs, with images and text as input and actions as output. The architecture may evolve, but the fundamental approach of ingesting diverse data into a single large model appears stable, with scaling coming primarily from data diversity rather than just quantity.
The π*0.6 release introduces RL from real-world robot experience, moving beyond pure imitation learning. The system collects autonomous deployment data with human corrections and reward signals, achieving 2x throughput improvements and enabling 13-hour autonomous coffee-making runs.
Physical Intelligence prioritizes real-world RL over simulation because manipulation tasks require modeling how the entire world reacts to robot actions, not just the robot's own body dynamics. Simulation works well for locomotion but fails to capture the long tail of manipulation failures.
Physical Intelligence is in a 'bootstrap phase' where any data source that improves models is valuable - sim, internet video, teleoperation. However, the long-term vision relies on deployment generating vastly more data than bootstrap sources, creating a self-improving data flywheel.
The models demonstrate surprising generalization across radically different robot forms and tasks - from surgical robots to drones to manipulation - in ways not fully understood. Diversity of training data appears to be the key to zero-shot deployment in new environments.
Physical Intelligence deliberately avoids picking specific applications early to prevent becoming narrowly focused like historical robotics companies. They prioritize expanding the technology's aperture and deployment readiness before determining commercialization models.
Unlike self-driving's 15+ year timeline requiring near-perfect reliability, robotics can deploy at 95% reliability for many tasks. The foundation model era provides common sense understanding that wasn't available during early self-driving development, potentially accelerating progress.
The team reflects on how mind-blowing it is that general-purpose learning algorithms trained end-to-end can achieve intelligence across vision, language, robotics, and more. This represents a fundamental shift from decomposing problems into subcomponents to letting models learn holistically from data.
Training General Robots for Any Task: Physical Intelligence’s Karol Hausman and Tobi Springenberg
Ask me anything about this podcast episode...
Try asking: