Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

November 25, 2025•11,965 words

Description

Fei-Fei Li and Justin Johnson are cofounders of World Labs, who have recently launched Marble (https://marble.worldlabs.ai/), a new kind of generative “world model” that can create editable 3D environ...

Summary

Fei-Fei Li and Justin Johnson discuss World Labs' Marble, a groundbreaking generative world model that creates editable 3D environments from text and images. They explore how spatial intelligence represents the next frontier beyond LLMs, drawing on their journey from ImageNet and early vision-language work to building production-ready 3D world models. The conversation covers technical architecture decisions around Gaussian splats, the role of physics in world models, and emerging use cases in gaming, VFX, robotics simulation, and design.

Jump to Topic

From ImageNet to World Labs: The Origin Story

Fei-Fei and Justin trace their collaboration from Justin joining her lab in 2012 (the same quarter AlexNet launched) through his career at Michigan and Meta, to reuniting over shared interest in world models and spatial intelligence as the next frontier beyond language models.

•Justin joined Fei-Fei's lab in 2012, the same quarter AlexNet was released, marking the start of the deep learning revolution
•Both independently identified world models and spatial intelligence as the natural evolution beyond language models
•The founding team combined deep academic research backgrounds with industry experience to tackle this new frontier
•Compute has scaled 1 million-fold from AlexNet days, enabling new possibilities in visual and spatial data processing

Academia vs Industry: Open Science and Resource Allocation

Discussion of the evolving role of academia in AI, the importance of maintaining open science alongside commercial work, and the critical need for better resourcing of academic AI research through initiatives like the National AI Resource (NAIR) Bill.

•Academia should focus on 'wacky ideas,' theoretical foundations, and blue-sky research rather than competing on largest models
•Fei-Fei advocates for National AI Resource (NAIR) Bill to provide compute cloud and data repositories for public sector research
•Stanford lab continues open science with datasets like 'behavior' for robotic learning benchmarking
•The ecosystem benefits from diversity: both open academic research and focused industry productization have important roles

Hardware Lottery and Future Architecture Primitives

Justin discusses the 'hardware lottery' concept and proposes exploring neural network architectures beyond matrix multiplication that could better match future distributed computing systems, as current GPU scaling shows diminishing returns in performance per watt.

•Current neural networks are built around matrix multiplication because it fits GPUs well, but this may not scale infinitely
•From Hopper to Blackwell, performance per watt remains similar - scaling comes from more transistors and power, not efficiency
•Future architectures should consider primitives optimized for distributed systems spanning thousands of devices
•This represents a long-term academic research opportunity requiring years of exploration, not quick startup cycles

Early Vision-Language Work: Image Captioning and Dense Captioning

The team recounts pioneering work combining CNNs with LSTMs for image captioning in 2014-2015, competing with Google, and advancing to dense captioning that described different parts of scenes with spatial awareness.

•Combined ConvNet image representations with LSTM language models to generate captions - one of first systems to do this
•Independently developed alongside Google, with both featured in New York Times coverage
•Advanced to 'dense captioning' (CVPR 2016) that drew boxes around objects and generated descriptions for each region
•Built real-time web demo streaming from webcam through Stanford server, achieving 1 FPS from California to Santiago conference

Defining Spatial Intelligence vs Linguistic Intelligence

Fei-Fei defines spatial intelligence as complementary to linguistic intelligence, encompassing the ability to reason, understand, move, and interact in 3D space - capabilities that took evolution 540 million years to optimize versus ~500,000 years for language.

•Spatial intelligence is the capability to reason, understand, move, and interact in space - distinct from but complementary to language
•Examples range from DNA double helix discovery requiring 3D molecular reasoning to everyday tasks like grasping a mug
•Evolution spent 540 million years optimizing perception and spatial intelligence vs ~500,000 years for language development
•Vision is underappreciated because it's effortless for humans, while language requires conscious effort to learn

Marble: Architecture, Capabilities, and Use Cases

Deep dive into Marble as a generative 3D world model using Gaussian splats, offering multimodal inputs (text, images), interactive editing, precise camera control, and emerging applications in gaming, VFX, film, robotics simulation, and interior design.

•Marble generates 3D worlds from text/images using Gaussian splats as atomic units, renderable in real-time on mobile devices
•Precise camera control and recording capabilities emerge naturally from true 3D spatial understanding
•Interactive editing allows users to modify scenes (change colors, remove objects, alter materials) and regenerate
•Use cases span creative industries (gaming, VFX, film), robotics training data generation, and interior/architectural design

Physics, Dynamics, and the Limits of Pattern Matching

Discussion of whether world models truly 'understand' physics or just fit patterns, the difference between generating plausible visuals versus modeling causal forces, and potential approaches to integrating physics through splat properties or classical engines.

•Current models fit patterns rather than learning causal laws - they may generate plausible orbits without understanding gravity
•For creative use cases (film backdrops), plausibility matters more than true physics; for architecture, accurate force modeling is critical
•Potential approaches: attach physical properties to splats and simulate, use classical physics engines, or regenerate scenes based on user actions
•Emergent capabilities at scale may eventually lead to implicit physics understanding, but this remains to be proven

Transformers as Set Models and Beyond Sequence-to-Sequence

Justin clarifies that transformers are natively set models, not sequence models - only positional embeddings impose order. This architectural insight suggests transformers can naturally extend beyond 1D sequences to spatial data.

•Transformers are mathematically models of sets, not sequences - they're permutation equivariant
•Only positional embeddings inject sequential order; all other operations (attention, FFN, normalization) are set-based
•This means transformers can naturally handle non-sequential data like 3D spatial tokens without fundamental architecture changes
•World models may move beyond sequence-to-sequence paradigms while keeping proven components like attention

Learning Paradigms: Theory Building vs Pattern Fitting

Exploration of how humans develop theories through interaction and hypothesis testing (like finding lost keys or discovering Newtonian physics) versus how current models learn from passive observation, and what new learning paradigms might bridge this gap.

•Human learning involves building theories, testing them through world interaction, and updating based on mismatches with expectations
•Current models learn from passive data observation without the hypothesis-testing loop that drives human understanding
•F=MA level abstraction may be beyond current LLMs even if they can predict trajectories accurately from celestial data
•More efficient learning could come from hypothesis-driven experiments that eliminate possible worlds rather than just pattern matching

Hiring and Future Directions at World Labs

World Labs is hiring across deep researchers working on world model training, engineers building systems from training to inference to product, and business/product talent to bring spatial intelligence to market.

•Seeking deep researchers to solve fundamental world model training and architecture problems
•Need engineers across the stack: training optimization, inference systems, and product development
•Looking for business, product, and go-to-market talent to deliver spatial intelligence products
•Intellectual fearlessness is a core principle - team is pioneering both model and product approaches with no prior playbook

Latent Space: The AI Engineer Podcast

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

0:00 / 0:00

View original episode →

Summary

Jump to Topic

From ImageNet to World Labs: The Origin Story

•Justin joined Fei-Fei's lab in 2012, the same quarter AlexNet was released, marking the start of the deep learning revolution
•Both independently identified world models and spatial intelligence as the natural evolution beyond language models
•The founding team combined deep academic research backgrounds with industry experience to tackle this new frontier
•Compute has scaled 1 million-fold from AlexNet days, enabling new possibilities in visual and spatial data processing

Academia vs Industry: Open Science and Resource Allocation

•Academia should focus on 'wacky ideas,' theoretical foundations, and blue-sky research rather than competing on largest models
•Fei-Fei advocates for National AI Resource (NAIR) Bill to provide compute cloud and data repositories for public sector research
•Stanford lab continues open science with datasets like 'behavior' for robotic learning benchmarking
•The ecosystem benefits from diversity: both open academic research and focused industry productization have important roles

Hardware Lottery and Future Architecture Primitives

•Current neural networks are built around matrix multiplication because it fits GPUs well, but this may not scale infinitely
•From Hopper to Blackwell, performance per watt remains similar - scaling comes from more transistors and power, not efficiency
•Future architectures should consider primitives optimized for distributed systems spanning thousands of devices
•This represents a long-term academic research opportunity requiring years of exploration, not quick startup cycles

Early Vision-Language Work: Image Captioning and Dense Captioning

•Combined ConvNet image representations with LSTM language models to generate captions - one of first systems to do this
•Independently developed alongside Google, with both featured in New York Times coverage
•Advanced to 'dense captioning' (CVPR 2016) that drew boxes around objects and generated descriptions for each region
•Built real-time web demo streaming from webcam through Stanford server, achieving 1 FPS from California to Santiago conference

Defining Spatial Intelligence vs Linguistic Intelligence

•Spatial intelligence is the capability to reason, understand, move, and interact in space - distinct from but complementary to language
•Examples range from DNA double helix discovery requiring 3D molecular reasoning to everyday tasks like grasping a mug
•Evolution spent 540 million years optimizing perception and spatial intelligence vs ~500,000 years for language development
•Vision is underappreciated because it's effortless for humans, while language requires conscious effort to learn

Marble: Architecture, Capabilities, and Use Cases

•Marble generates 3D worlds from text/images using Gaussian splats as atomic units, renderable in real-time on mobile devices
•Precise camera control and recording capabilities emerge naturally from true 3D spatial understanding
•Interactive editing allows users to modify scenes (change colors, remove objects, alter materials) and regenerate
•Use cases span creative industries (gaming, VFX, film), robotics training data generation, and interior/architectural design

Physics, Dynamics, and the Limits of Pattern Matching

•Current models fit patterns rather than learning causal laws - they may generate plausible orbits without understanding gravity
•For creative use cases (film backdrops), plausibility matters more than true physics; for architecture, accurate force modeling is critical
•Potential approaches: attach physical properties to splats and simulate, use classical physics engines, or regenerate scenes based on user actions
•Emergent capabilities at scale may eventually lead to implicit physics understanding, but this remains to be proven

Transformers as Set Models and Beyond Sequence-to-Sequence

•Transformers are mathematically models of sets, not sequences - they're permutation equivariant
•Only positional embeddings inject sequential order; all other operations (attention, FFN, normalization) are set-based
•This means transformers can naturally handle non-sequential data like 3D spatial tokens without fundamental architecture changes
•World models may move beyond sequence-to-sequence paradigms while keeping proven components like attention

Learning Paradigms: Theory Building vs Pattern Fitting

•Human learning involves building theories, testing them through world interaction, and updating based on mismatches with expectations
•Current models learn from passive data observation without the hypothesis-testing loop that drives human understanding
•F=MA level abstraction may be beyond current LLMs even if they can predict trajectories accurately from celestial data
•More efficient learning could come from hypothesis-driven experiments that eliminate possible worlds rather than just pattern matching

Hiring and Future Directions at World Labs

•Seeking deep researchers to solve fundamental world model training and architecture problems
•Need engineers across the stack: training optimization, inference systems, and product development
•Looking for business, product, and go-to-market talent to deliver spatial intelligence products
•Intellectual fearlessness is a core principle - team is pioneering both model and product approaches with no prior playbook

Latent Space: The AI Engineer Podcast

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

0:00 / 0:00

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Description

Summary

Jump to Topic

From ImageNet to World Labs: The Origin Story

Academia vs Industry: Open Science and Resource Allocation

Hardware Lottery and Future Architecture Primitives

Early Vision-Language Work: Image Captioning and Dense Captioning

Defining Spatial Intelligence vs Linguistic Intelligence

Marble: Architecture, Capabilities, and Use Cases

Physics, Dynamics, and the Limits of Pattern Matching

Transformers as Set Models and Beyond Sequence-to-Sequence

Learning Paradigms: Theory Building vs Pattern Fitting

Hiring and Future Directions at World Labs

Navigate

Chat with Episode

Summary

Jump to Topic

From ImageNet to World Labs: The Origin Story

Academia vs Industry: Open Science and Resource Allocation

Hardware Lottery and Future Architecture Primitives

Early Vision-Language Work: Image Captioning and Dense Captioning

Defining Spatial Intelligence vs Linguistic Intelligence

Marble: Architecture, Capabilities, and Use Cases

Physics, Dynamics, and the Limits of Pattern Matching

Transformers as Set Models and Beyond Sequence-to-Sequence

Learning Paradigms: Theory Building vs Pattern Fitting

Hiring and Future Directions at World Labs

Navigate

Chat with Episode