Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

December 6, 2025•12,995 words

Description

From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-50...

Summary

Pim de Witte, CEO of General Intuition (GI), discusses spinning out from Medal (a 12M-user game clipping platform) to build world models trained on 3.8B gameplay highlights. After turning down OpenAI's reported $500M offer for Medal's data, GI raised a $134M seed from Khosla Ventures—Vinod's largest bet since OpenAI. The conversation covers their vision-based agents achieving superhuman gameplay through imitation learning, their unique action-labeled dataset that preserves privacy, and their strategy to become the foundation model for spatial reasoning across gaming, simulation, and eventually robotics.

Jump to Topic

Live Demo: Vision-Based Gaming Agents and World Models

Pim demonstrates GI's vision-based agents playing Counter-Strike purely from pixels, showing progression from 4 months ago to current superhuman performance. The agents exhibit human-like behaviors (checking scoreboards, getting unstuck) while also displaying peak performance from training on highlight clips. He then shows world models with unique features like mouse sensitivity, spatial memory over 20-second generations, and the ability to handle partial observability (smoke grenades).

•Agents predict actions from frames only—no game state access, running in real-time against human players
•Training on highlight clips creates 'peak human' baseline performance with occasional superhuman moments
•World models maintain spatial consistency through camera shakes, rapid movements, and partial observability
•Models can transfer from games to real-world video for action prediction using keyboard/mouse inputs
•Distilled smaller models demonstrate the scalability of the approach while maintaining core capabilities

Medal's Journey: Building the Largest Game Clipping Platform

Pim explains Medal's 10-year evolution from a simple recorder to a 12M-user platform with more active creators than Twitch. The key innovation was retroactive clipping—running a background recorder that exports only the last 30 seconds when you hit a button, similar to Tesla's bug reporting. This approach captured authentic gameplay without changing player behavior, creating the foundation for GI's unique dataset.

•Medal focused on building the best capture tool first, then bootstrapped the social network on top
•Retroactive clipping removes friction—no manual start/stop, automatic sync to phone, only saves interesting moments
•COVID and Discord's rise as connective tissue between gamers accelerated Medal's network effects
•The 'always-on' recording captures authentic behavior crucial for training, unlike conscious recording
•Platform processes millions of video views daily, creating both data and distribution advantages

Privacy-First Action Labeling: The Data Moat

GI's critical design decision was logging actions (jump, crouch, shoot) rather than raw keystrokes (W, A, S, D), solving privacy concerns while creating superior training data. They employed thousands of humans to label every possible action across games over 18 months. This approach converts inputs to semantic actions, making the data usable for training while preventing individual tracking.

•Action labels (semantic) vs keystroke logs (raw input)—actions are better training data and privacy-preserving
•At training time, raw keys add noise; converting to actions creates cleaner, more generalizable data
•Thousands of human labelers mapped every action across thousands of games over 18 months
•Cannot reconstruct individual behavior from action labels, but can train general models
•This dataset is 1-2 orders of magnitude larger than any other ground-truth action-labeled video corpus

Assembling the Research Team and Vinod Khosla's $134M Bet

After reading papers like Diamond, Genie, and SIMA, Pim cold-emailed research teams and chose to build independently rather than join a lab. Vinod Khosla's investment process involved defending a 2030 vision from first principles under intense technical questioning. The team includes Diamond paper authors and Anthony Hu from GAIA-2, with GI able to publish openly due to their data moat.

•Khosla's process: paint 2030 vision, walk it back from first principles, defend every assumption technically
•Multiple labs tried to acquire Medal's data; choosing independence allowed parallel work on policies and world models
•Core team from Diamond (world models on 90 hours of data at 10 FPS) and GAIA-2 (DeepMind)
•Data moat enables open research culture—competitors can't replicate without the dataset
•Partnership with QLAIR (Eric Schmidt-funded open science lab in Paris) for academic collaboration

Why Game Data Beats Simulation and YouTube for Spatial Reasoning

Pim explains why games provide superior training data compared to simulation or YouTube videos. Simulation complexity explodes with number of agents, degrees of freedom, and information revealed per action. Games already contain the stochasticity and edge cases, while YouTube requires solving pose estimation, inverse dynamics, and optical dynamics—three layers of information loss.

•Simulation compute complexity scales rapidly with: agent count, degrees of freedom, and action information density
•Games simulate optical dynamics with hand movements—direct action-to-perception loop without information loss
•YouTube requires: pose estimation → inverse dynamics → optical dynamics reconstruction (3 lossy steps)
•Highlight clips self-select for negative events and adversarial cases—the hardest 20% of training data
•Games represent 'episodic memory for simulation'—compressed, high-value moments from hours of play

Business Model: API-First for Game Developers and Robotics

GI's initial customers are game developers and engine companies, replacing deterministic behavior trees with a single API: stream frames, get actions back. The goal is moving pre-training to post-training for robotics companies—if your robot uses game controller inputs, GI can provide the foundation model, requiring only 1-10% of typical training data for specialization.

•Primary use case: better game bots for player retention during low-liquidity hours (3 AM matchmaking)
•API model like Anthropic—stream frames and optional text conditioning, receive action predictions
•Working with robotics and manufacturing companies using gaming inputs (not high-DOF humanoids)
•Value proposition: move 80-90% of pre-training work to post-training using GI's foundation models
•Also offering video labeling services and model distillation for custom deployments

From Imitation Learning to RL: Making Every Clip Playable

GI's research roadmap involves making all 3.8B clips on Medal playable inside world models, enabling the transition from imitation learning to reinforcement learning. Each clip represents episodic memory—the most out-of-distribution moments from hours of gameplay. By loading negative events (crashes, failures) into world models with ground-truth action labels, they can train reward models at unprecedented scale.

•Every Medal clip is 'episodic memory in simulation'—2-3 minutes of highlights from 3 hours of play
•Clips are pre-labeled by users (crashes, amazing plays, failures)—natural reward signal
•World models enable replaying any clip with different actions—counterfactual training at scale
•Transition from LLM-style imitation learning to RL by training reward models on billions of negative events
•Consumer feature: friends can play your clips, exploring alternate outcomes in the world model

2030 Vision: 80% of Atoms-to-Atoms AI Interactions

By 2030, GI aims to be the gold standard for spatial-temporal intelligence, powering 80% of atoms-to-atoms AI interactions (robotics, physical world) and 100x more in simulation. The bet is that supply chains will converge on gaming inputs as the standard interface because intelligence is the bottleneck, not hardware. Simulation will see faster adoption due to fewer safety constraints, with scientific use cases as a key focus.

•Three stages of AI: bits-to-bits (LLMs), atoms-to-bits (perception), bits-to-atoms (generation), atoms-to-atoms (robotics)
•Target: 80% market share in atoms-to-atoms by making gaming inputs the standard robot interface
•Simulation market will be 100x larger initially—fewer safety constraints, faster iteration
•Scientific simulation (virtual biology, physics) as key early application beyond gaming
•Spatial-temporal reasoning as the 'root problem' of intelligence—any long sequence is fundamentally spatial

Latent Space: The AI Engineer Podcast

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

0:00 / 0:00

View original episode →

Summary

Jump to Topic

Live Demo: Vision-Based Gaming Agents and World Models

•Agents predict actions from frames only—no game state access, running in real-time against human players
•Training on highlight clips creates 'peak human' baseline performance with occasional superhuman moments
•World models maintain spatial consistency through camera shakes, rapid movements, and partial observability
•Models can transfer from games to real-world video for action prediction using keyboard/mouse inputs
•Distilled smaller models demonstrate the scalability of the approach while maintaining core capabilities

Medal's Journey: Building the Largest Game Clipping Platform

•Medal focused on building the best capture tool first, then bootstrapped the social network on top
•Retroactive clipping removes friction—no manual start/stop, automatic sync to phone, only saves interesting moments
•COVID and Discord's rise as connective tissue between gamers accelerated Medal's network effects
•The 'always-on' recording captures authentic behavior crucial for training, unlike conscious recording
•Platform processes millions of video views daily, creating both data and distribution advantages

Privacy-First Action Labeling: The Data Moat

•Action labels (semantic) vs keystroke logs (raw input)—actions are better training data and privacy-preserving
•At training time, raw keys add noise; converting to actions creates cleaner, more generalizable data
•Thousands of human labelers mapped every action across thousands of games over 18 months
•Cannot reconstruct individual behavior from action labels, but can train general models
•This dataset is 1-2 orders of magnitude larger than any other ground-truth action-labeled video corpus

Assembling the Research Team and Vinod Khosla's $134M Bet

•Khosla's process: paint 2030 vision, walk it back from first principles, defend every assumption technically
•Multiple labs tried to acquire Medal's data; choosing independence allowed parallel work on policies and world models
•Core team from Diamond (world models on 90 hours of data at 10 FPS) and GAIA-2 (DeepMind)
•Data moat enables open research culture—competitors can't replicate without the dataset
•Partnership with QLAIR (Eric Schmidt-funded open science lab in Paris) for academic collaboration

Why Game Data Beats Simulation and YouTube for Spatial Reasoning

•Simulation compute complexity scales rapidly with: agent count, degrees of freedom, and action information density
•Games simulate optical dynamics with hand movements—direct action-to-perception loop without information loss
•YouTube requires: pose estimation → inverse dynamics → optical dynamics reconstruction (3 lossy steps)
•Highlight clips self-select for negative events and adversarial cases—the hardest 20% of training data
•Games represent 'episodic memory for simulation'—compressed, high-value moments from hours of play

Business Model: API-First for Game Developers and Robotics

•Primary use case: better game bots for player retention during low-liquidity hours (3 AM matchmaking)
•API model like Anthropic—stream frames and optional text conditioning, receive action predictions
•Working with robotics and manufacturing companies using gaming inputs (not high-DOF humanoids)
•Value proposition: move 80-90% of pre-training work to post-training using GI's foundation models
•Also offering video labeling services and model distillation for custom deployments

From Imitation Learning to RL: Making Every Clip Playable

•Every Medal clip is 'episodic memory in simulation'—2-3 minutes of highlights from 3 hours of play
•Clips are pre-labeled by users (crashes, amazing plays, failures)—natural reward signal
•World models enable replaying any clip with different actions—counterfactual training at scale
•Transition from LLM-style imitation learning to RL by training reward models on billions of negative events
•Consumer feature: friends can play your clips, exploring alternate outcomes in the world model

2030 Vision: 80% of Atoms-to-Atoms AI Interactions

•Three stages of AI: bits-to-bits (LLMs), atoms-to-bits (perception), bits-to-atoms (generation), atoms-to-atoms (robotics)
•Target: 80% market share in atoms-to-atoms by making gaming inputs the standard robot interface
•Simulation market will be 100x larger initially—fewer safety constraints, faster iteration
•Scientific simulation (virtual biology, physics) as key early application beyond gaming
•Spatial-temporal reasoning as the 'root problem' of intelligence—any long sequence is fundamentally spatial

Latent Space: The AI Engineer Podcast

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

0:00 / 0:00

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Description

Summary

Jump to Topic

Live Demo: Vision-Based Gaming Agents and World Models

Medal's Journey: Building the Largest Game Clipping Platform

Privacy-First Action Labeling: The Data Moat

Assembling the Research Team and Vinod Khosla's $134M Bet

Why Game Data Beats Simulation and YouTube for Spatial Reasoning

Business Model: API-First for Game Developers and Robotics

From Imitation Learning to RL: Making Every Clip Playable

2030 Vision: 80% of Atoms-to-Atoms AI Interactions

Navigate

Chat with Episode

Summary

Jump to Topic

Live Demo: Vision-Based Gaming Agents and World Models

Medal's Journey: Building the Largest Game Clipping Platform

Privacy-First Action Labeling: The Data Moat

Assembling the Research Team and Vinod Khosla's $134M Bet

Why Game Data Beats Simulation and YouTube for Spatial Reasoning

Business Model: API-First for Game Developers and Robotics

From Imitation Learning to RL: Making Every Clip Playable

2030 Vision: 80% of Atoms-to-Atoms AI Interactions

Navigate

Chat with Episode