| Episode | Status |
|---|---|
Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environ...
Fei-Fei Li and Justin Johnson discuss World Labs' Marble, a groundbreaking generative world model that creates editable 3D environments from text and images. They explore how spatial intelligence represents the next frontier beyond LLMs, drawing on their journey from ImageNet and early vision-language work to building production-ready 3D world models. The conversation covers technical architecture decisions around Gaussian splats, the role of physics in world models, and emerging use cases in gaming, VFX, robotics simulation, and design.
Fei-Fei and Justin trace their collaboration from Justin joining her lab in 2012 (the same quarter AlexNet launched) through his career at Michigan and Meta, to reuniting over shared interest in world models and spatial intelligence as the next frontier beyond language models.
Discussion of the evolving role of academia in AI, the importance of maintaining open science alongside commercial work, and the critical need for better resourcing of academic AI research through initiatives like the National AI Resource (NAIR) Bill.
Justin discusses the 'hardware lottery' concept and proposes exploring neural network architectures beyond matrix multiplication that could better match future distributed computing systems, as current GPU scaling shows diminishing returns in performance per watt.
The team recounts pioneering work combining CNNs with LSTMs for image captioning in 2014-2015, competing with Google, and advancing to dense captioning that described different parts of scenes with spatial awareness.
Fei-Fei defines spatial intelligence as complementary to linguistic intelligence, encompassing the ability to reason, understand, move, and interact in 3D space - capabilities that took evolution 540 million years to optimize versus ~500,000 years for language.
Deep dive into Marble as a generative 3D world model using Gaussian splats, offering multimodal inputs (text, images), interactive editing, precise camera control, and emerging applications in gaming, VFX, film, robotics simulation, and interior design.
Discussion of whether world models truly 'understand' physics or just fit patterns, the difference between generating plausible visuals versus modeling causal forces, and potential approaches to integrating physics through splat properties or classical engines.
Justin clarifies that transformers are natively set models, not sequence models - only positional embeddings impose order. This architectural insight suggests transformers can naturally extend beyond 1D sequences to spatial data.
Exploration of how humans develop theories through interaction and hypothesis testing (like finding lost keys or discovering Newtonian physics) versus how current models learn from passive observation, and what new learning paradigms might bridge this gap.
World Labs is hiring across deep researchers working on world model training, engineers building systems from training to inference to product, and business/product talent to bring spatial intelligence to market.
After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs
Ask me anything about this podcast episode...
Try asking: