Episode	Podcast	Published	Duration	Status

The a16z Show

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

December 5, 2025•1h 1m•12,245 words•Justin Johnson, Alessio Fanelli, Swyx, Fei-Fei Li

Description

Fei-Fei Li is a Stanford professor, co-director of Stanford Institute for Human-Centered Artificial Intelligence, and co-founder of World Labs. She created ImageNet, the dataset that sparked the deep ...

Summary

Fei-Fei Li and Justin Johnson of World Labs discuss their journey from ImageNet to building Marble, the first model generating explorable 3D worlds. They explore why spatial intelligence is fundamentally different from language models, requiring understanding of 3D structure, physics, and embodied interaction. Key insights include transformers being set models not sequence models, the role of Gaussian splats in real-time 3D rendering, and emerging use cases in gaming, VFX, robotics training, and interior design.

Jump to Topic

Introduction to World Labs and Marble: From ImageNet to 3D World Generation

Overview of Fei-Fei Li's creation of ImageNet and her reunion with former PhD student Justin Johnson to build World Labs. They introduce Marble as the first model generating explorable 3D worlds from text or images, discussing the evolution from AlexNet era to today's compute scaling (1000x more performance per GPU, million-fold more total compute).

•Marble generates 3D worlds from text/images with precise camera control and real-time rendering
•Compute has scaled 1000x per GPU since AlexNet, enabling new visual/spatial models
•World Labs focuses on spatial intelligence as the next frontier beyond language models
•Technology designed to be both a research platform and immediately useful product

Academia vs Industry: Resourcing and the Role of Open Science

Discussion of the changing role of academia in AI, the importance of open datasets like ImageNet, and current resource imbalances. Fei-Fei advocates for National AI Resource (NAIR) bill to provide compute and data infrastructure for public sector research. Justin emphasizes academia should focus on 'wacky ideas' and theoretical work rather than competing on scale.

•Academia severely under-resourced compared to industry labs for AI research
•Open science still important - Stanford recently released 'Behavior' dataset for robotic learning
•Academic focus should shift to novel architectures, theoretical understanding, and interdisciplinary work
•National AI Resource bill aims to provide compute cloud and data repository for public sector

Hardware Lottery and Future Computing Primitives Beyond Matrix Multiplication

Justin discusses how neural network architectures are constrained by hardware primitives, particularly how transformers are built around matrix multiplication because it fits GPUs well. He proposes exploring new primitives for distributed systems as hardware scales, noting performance per watt plateauing from Hopper to Blackwell.

•Current neural networks designed around matrix multiplication to fit GPU architecture
•Performance per watt plateauing between GPU generations (Hopper to Blackwell)
•Opportunity to design new architectures for distributed systems spanning thousands of devices
•Long-term research question: what primitives make sense for next-generation hardware?

Image Captioning Breakthrough: Combining CNNs with LSTMs

The story of pioneering image captioning work combining convolutional neural networks (for image representation) with LSTM language models. Andre Karpathy and Fei-Fei independently achieved this alongside Google in 2015. Justin joined to extend this to 'dense captioning' - describing different parts of images with region-specific captions.

•First image captioning paper (CVPR 2015) combined CNNs with LSTM language models
•Google and Stanford independently achieved image captioning simultaneously
•Dense captioning (CVPR 2016) added spatial understanding by captioning different image regions
•Built real-time webcam demo streaming from California to Santiago conference

Spatial Intelligence vs Language: Why 3D Worlds Are Fundamentally Different

Deep dive into why spatial intelligence differs from language models. Fei-Fei argues 3D/4D spatial worlds have structural properties fundamentally different from 1D generative signals. Discussion covers pixel maximalism, the limitations of tokenized text (losing fonts, layout), and whether models can learn causal physics laws versus just pattern fitting.

•Spatial intelligence requires understanding 3D/4D structure, not just pattern matching
•Pixels are more lossless representation than tokenized text (preserve fonts, layout, 2D arrangement)
•Deep learning models fit patterns but may not learn causal laws like F=ma
•For some use cases (VFX backdrops) plausibility matters; for others (architecture) physics accuracy critical

Marble Architecture: Gaussian Splats, Real-time Rendering, and Editability

Technical explanation of Marble's architecture using Gaussian splats as atomic units - semi-transparent particles with position/orientation in 3D space. Enables real-time rendering on mobile devices and VR headsets. Discusses multimodal inputs (text, images), interactive editing capabilities, and export formats. Emerging use cases in gaming, VFX, film, interior design, and robotics training.

•Gaussian splats enable real-time rendering at 30-60 FPS on mobile devices and VR headsets
•Precise camera control requires true 3D spatial understanding, not frame-by-frame generation
•Interactive editing allows changing colors, objects, layouts within generated 3D scenes
•Use cases span creative industries (gaming, VFX) and simulation (robotics training, interior design)

Adding Physics and Dynamics to 3D World Models

Discussion of approaches to integrate physics into world models: explicit (measuring forces, running physics simulations) versus emergent (hoping physics emerges from end-to-end training). Options include attaching physical properties to Gaussian splats, using classical physics engines, or regenerating entire scenes in response to user actions.

•Can attach mass and spring properties to Gaussian splats for physics simulation
•Classical physics engines can generate training data to distill into neural networks
•Trade-off between computational efficiency and generality of physical modeling
•Future Marble versions could regenerate entire scenes in response to user interactions

Robotics and Embodied AI: Synthetic Data for Training

Marble's potential for robotics training through synthetic data generation. Addresses data starvation problem in robotics - real-world data is scarce, internet videos lack controllability. Marble provides middle ground by generating diverse simulated environments for embodied agent training with precise control over states and interactions.

•Robotics training severely limited by lack of high-fidelity real-world data
•Synthetic simulation data provides controllability needed for embodied agent training
•Marble can generate diverse 3D environments for robot training scenarios
•Early beta users already building interior design applications via API

Defining Spatial Intelligence: Complementary to Language, Not Replacement

Fei-Fei defines spatial intelligence as the capability to reason, understand, move, and interact in space - complementary to linguistic intelligence, not 'traditional' intelligence. Uses examples from DNA structure discovery to everyday tasks like grasping a mug. Discusses Howard Gardner's theory of multiple intelligences and why vision is underappreciated despite being evolutionarily optimized over 540 million years versus ~500,000 years for language.

•Spatial intelligence: capability to reason, understand, move, and interact in 3D space
•Complementary to language intelligence, not a replacement - both needed for complete AI
•Vision underappreciated because it's effortless for humans, unlike language which requires learning
•Evolution spent 540M years optimizing perception vs ~500K years for language development

Transformers as Set Models: Architecture Insights and Future Directions

Justin clarifies that transformers are natively set models, not sequence models - only positional embeddings inject ordering. All transformer operations (FFN, attention, normalization) are permutation equivariant. Discusses what needs to change for world models: attention and core mechanisms can stay, but may need architectures beyond sequence-to-sequence for spatial modeling.

•Transformers are mathematically set models - attention is permutation equivariant
•Only positional embeddings differentiate token order in standard transformers
•Don't need to throw away working components (attention) when building new modalities
•World models may require architectures beyond sequence-to-sequence paradigm

Hiring and Future Vision: Building the Spatial Intelligence Team

Call to action for talent across research (deep model training), engineering (systems, optimization, inference, product), and business (go-to-market, product thinking). Emphasis on intellectual fearlessness and exploring advanced Marble features. Vision of spatial intelligence as horizontal technology applicable across creative industries, robotics, design, and beyond.

•Seeking researchers for world model training, engineers for systems/product, business talent for GTM
•Intellectual fearlessness valued - first movers in both model and product development
•Advanced editing features in Marble not yet fully discovered by users
•Spatial intelligence as horizontal technology spanning multiple industries and use cases

The a16z Show

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

0:00 / 0:00

View original episode →

Summary

Jump to Topic

Introduction to World Labs and Marble: From ImageNet to 3D World Generation

•Marble generates 3D worlds from text/images with precise camera control and real-time rendering
•Compute has scaled 1000x per GPU since AlexNet, enabling new visual/spatial models
•World Labs focuses on spatial intelligence as the next frontier beyond language models
•Technology designed to be both a research platform and immediately useful product

Academia vs Industry: Resourcing and the Role of Open Science

•Academia severely under-resourced compared to industry labs for AI research
•Open science still important - Stanford recently released 'Behavior' dataset for robotic learning
•Academic focus should shift to novel architectures, theoretical understanding, and interdisciplinary work
•National AI Resource bill aims to provide compute cloud and data repository for public sector

Hardware Lottery and Future Computing Primitives Beyond Matrix Multiplication

•Current neural networks designed around matrix multiplication to fit GPU architecture
•Performance per watt plateauing between GPU generations (Hopper to Blackwell)
•Opportunity to design new architectures for distributed systems spanning thousands of devices
•Long-term research question: what primitives make sense for next-generation hardware?

Image Captioning Breakthrough: Combining CNNs with LSTMs

•First image captioning paper (CVPR 2015) combined CNNs with LSTM language models
•Google and Stanford independently achieved image captioning simultaneously
•Dense captioning (CVPR 2016) added spatial understanding by captioning different image regions
•Built real-time webcam demo streaming from California to Santiago conference

Spatial Intelligence vs Language: Why 3D Worlds Are Fundamentally Different

•Spatial intelligence requires understanding 3D/4D structure, not just pattern matching
•Pixels are more lossless representation than tokenized text (preserve fonts, layout, 2D arrangement)
•Deep learning models fit patterns but may not learn causal laws like F=ma
•For some use cases (VFX backdrops) plausibility matters; for others (architecture) physics accuracy critical

Marble Architecture: Gaussian Splats, Real-time Rendering, and Editability

•Gaussian splats enable real-time rendering at 30-60 FPS on mobile devices and VR headsets
•Precise camera control requires true 3D spatial understanding, not frame-by-frame generation
•Interactive editing allows changing colors, objects, layouts within generated 3D scenes
•Use cases span creative industries (gaming, VFX) and simulation (robotics training, interior design)

Adding Physics and Dynamics to 3D World Models

•Can attach mass and spring properties to Gaussian splats for physics simulation
•Classical physics engines can generate training data to distill into neural networks
•Trade-off between computational efficiency and generality of physical modeling
•Future Marble versions could regenerate entire scenes in response to user interactions

Robotics and Embodied AI: Synthetic Data for Training

•Robotics training severely limited by lack of high-fidelity real-world data
•Synthetic simulation data provides controllability needed for embodied agent training
•Marble can generate diverse 3D environments for robot training scenarios
•Early beta users already building interior design applications via API

Defining Spatial Intelligence: Complementary to Language, Not Replacement

•Spatial intelligence: capability to reason, understand, move, and interact in 3D space
•Complementary to language intelligence, not a replacement - both needed for complete AI
•Vision underappreciated because it's effortless for humans, unlike language which requires learning
•Evolution spent 540M years optimizing perception vs ~500K years for language development

Transformers as Set Models: Architecture Insights and Future Directions

•Transformers are mathematically set models - attention is permutation equivariant
•Only positional embeddings differentiate token order in standard transformers
•Don't need to throw away working components (attention) when building new modalities
•World models may require architectures beyond sequence-to-sequence paradigm

Hiring and Future Vision: Building the Spatial Intelligence Team

•Seeking researchers for world model training, engineers for systems/product, business talent for GTM
•Intellectual fearlessness valued - first movers in both model and product development
•Advanced editing features in Marble not yet fully discovered by users
•Spatial intelligence as horizontal technology spanning multiple industries and use cases

The a16z Show

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

0:00 / 0:00

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

Description

Summary

Jump to Topic

Introduction to World Labs and Marble: From ImageNet to 3D World Generation

Academia vs Industry: Resourcing and the Role of Open Science

Hardware Lottery and Future Computing Primitives Beyond Matrix Multiplication

Image Captioning Breakthrough: Combining CNNs with LSTMs

Spatial Intelligence vs Language: Why 3D Worlds Are Fundamentally Different

Marble Architecture: Gaussian Splats, Real-time Rendering, and Editability

Adding Physics and Dynamics to 3D World Models

Robotics and Embodied AI: Synthetic Data for Training

Defining Spatial Intelligence: Complementary to Language, Not Replacement

Transformers as Set Models: Architecture Insights and Future Directions

Hiring and Future Vision: Building the Spatial Intelligence Team

Navigate

Chat with Episode

Summary

Jump to Topic

Introduction to World Labs and Marble: From ImageNet to 3D World Generation

Academia vs Industry: Resourcing and the Role of Open Science

Hardware Lottery and Future Computing Primitives Beyond Matrix Multiplication

Image Captioning Breakthrough: Combining CNNs with LSTMs

Spatial Intelligence vs Language: Why 3D Worlds Are Fundamentally Different

Marble Architecture: Gaussian Splats, Real-time Rendering, and Editability

Adding Physics and Dynamics to 3D World Models

Robotics and Embodied AI: Synthetic Data for Training

Defining Spatial Intelligence: Complementary to Language, Not Replacement

Transformers as Set Models: Architecture Insights and Future Directions

Hiring and Future Vision: Building the Spatial Intelligence Team

Navigate

Chat with Episode