| Episode | Status |
|---|---|
Fei-Fei Li is a Stanford professor, co-director of Stanford Institute for Human-Centered Artificial Intelligence, and co-founder of World Labs. She created ImageNet, the dataset that sparked the deep ...
Fei-Fei Li and Justin Johnson of World Labs discuss their journey from ImageNet to building Marble, the first model generating explorable 3D worlds. They explore why spatial intelligence is fundamentally different from language models, requiring understanding of 3D structure, physics, and embodied interaction. Key insights include transformers being set models not sequence models, the role of Gaussian splats in real-time 3D rendering, and emerging use cases in gaming, VFX, robotics training, and interior design.
Overview of Fei-Fei Li's creation of ImageNet and her reunion with former PhD student Justin Johnson to build World Labs. They introduce Marble as the first model generating explorable 3D worlds from text or images, discussing the evolution from AlexNet era to today's compute scaling (1000x more performance per GPU, million-fold more total compute).
Discussion of the changing role of academia in AI, the importance of open datasets like ImageNet, and current resource imbalances. Fei-Fei advocates for National AI Resource (NAIR) bill to provide compute and data infrastructure for public sector research. Justin emphasizes academia should focus on 'wacky ideas' and theoretical work rather than competing on scale.
Justin discusses how neural network architectures are constrained by hardware primitives, particularly how transformers are built around matrix multiplication because it fits GPUs well. He proposes exploring new primitives for distributed systems as hardware scales, noting performance per watt plateauing from Hopper to Blackwell.
The story of pioneering image captioning work combining convolutional neural networks (for image representation) with LSTM language models. Andre Karpathy and Fei-Fei independently achieved this alongside Google in 2015. Justin joined to extend this to 'dense captioning' - describing different parts of images with region-specific captions.
Deep dive into why spatial intelligence differs from language models. Fei-Fei argues 3D/4D spatial worlds have structural properties fundamentally different from 1D generative signals. Discussion covers pixel maximalism, the limitations of tokenized text (losing fonts, layout), and whether models can learn causal physics laws versus just pattern fitting.
Technical explanation of Marble's architecture using Gaussian splats as atomic units - semi-transparent particles with position/orientation in 3D space. Enables real-time rendering on mobile devices and VR headsets. Discusses multimodal inputs (text, images), interactive editing capabilities, and export formats. Emerging use cases in gaming, VFX, film, interior design, and robotics training.
Discussion of approaches to integrate physics into world models: explicit (measuring forces, running physics simulations) versus emergent (hoping physics emerges from end-to-end training). Options include attaching physical properties to Gaussian splats, using classical physics engines, or regenerating entire scenes in response to user actions.
Marble's potential for robotics training through synthetic data generation. Addresses data starvation problem in robotics - real-world data is scarce, internet videos lack controllability. Marble provides middle ground by generating diverse simulated environments for embodied agent training with precise control over states and interactions.
Fei-Fei defines spatial intelligence as the capability to reason, understand, move, and interact in space - complementary to linguistic intelligence, not 'traditional' intelligence. Uses examples from DNA structure discovery to everyday tasks like grasping a mug. Discusses Howard Gardner's theory of multiple intelligences and why vision is underappreciated despite being evolutionarily optimized over 540 million years versus ~500,000 years for language.
Justin clarifies that transformers are natively set models, not sequence models - only positional embeddings inject ordering. All transformer operations (FFN, attention, normalization) are permutation equivariant. Discusses what needs to change for world models: attention and core mechanisms can stay, but may need architectures beyond sequence-to-sequence for spatial modeling.
Call to action for talent across research (deep model training), engineering (systems, optimization, inference, product), and business (go-to-market, product thinking). Emphasis on intellectual fearlessness and exploring advanced Marble features. Vision of spatial intelligence as horizontal technology applicable across creative industries, robotics, design, and beyond.
What Comes After ChatGPT? The Mother of ImageNet Predicts The Future
Ask me anything about this podcast episode...
Try asking: