Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

January 2, 2026•1698•5,831 words

Description

From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional w...

Summary

Princeton researchers won NeurIPS 2024 Best Paper by scaling reinforcement learning networks to 1,000 layers—a feat previously thought impossible. The breakthrough came from shifting from traditional value-based RL to self-supervised contrastive learning, treating RL as a classification problem rather than regression. Combined with architectural tricks like residual connections and layer normalization, they achieved state-of-the-art performance on goal-conditioned RL tasks using GPU-accelerated JAX environments, all trainable on a single H100 GPU.

Jump to Topic

From Undergrad Seminar to NeurIPS Best Paper

The team shares how their award-winning paper originated from an undergraduate research seminar at Princeton. Despite advisor skepticism about deep networks in RL (historically limited to 2-4 layers), they pursued the high-risk bet with relatively low cost due to existing infrastructure.

•Project started in Ben Eysenbach's independent work seminar for undergrads
•Advisor was initially skeptical—deep RL networks had failed before in literature
•Low-cost experiment enabled by year of infrastructure development by Mihaly
•Team discovered Best Paper award via email, not from review scores alone

Why RL Failed to Scale While Vision and Language Succeeded

The fundamental problem: while NLP and vision scaled to billions of parameters, RL remained stuck with shallow 2-layer MLPs. Traditional value-based RL doesn't scale due to noisy, biased TD error regression. The solution: shift to self-supervised RL using contrastive learning on state-action-future state representations.

•Language and vision use hundreds of billions of parameters; RL stuck at 2-4 layers
•Self-supervised RL learns representations via contrastive learning, not value functions
•Push together representations along same trajectory, push apart different trajectories
•Enables goal-reaching without human-crafted reward signals
•Initial attempts failed—required combination of depth + residual connections + layer norm

Critical Architectural Breakthroughs and Scaling Insights

Success required non-obvious combination of factors. Scaling width or batch size alone didn't work. The breakthrough came from combining increased depth with residual connections and layer normalization, revealing critical depth thresholds where performance multiplied dramatically.

•Scaling width increases parameters quadratically vs linearly for depth—less efficient
•Neither depth alone nor residual connections alone worked—needed both together
•Discovered critical depths (e.g., 64 layers) where performance skyrockets
•Most environments saturate at 64 layers, don't need full 1,000 layers
•Depth scaling more parameter-efficient and sample-efficient than width scaling

Blurring RL and Self-Supervised Learning Boundaries

The method challenges the definition of reinforcement learning itself—no code explicitly maximizes rewards. Instead, it shifts the learning burden from noisy TD regression to binary classification (is this future state on the same trajectory?), leveraging proven scalability of cross-entropy loss and representation learning.

•Code contains no 'maximize rewards' line—more similar to self-supervised methods
•Transforms RL from regression problem to classification problem
•Binary classification: same trajectory vs different trajectory for future states
•Leverages scalability of cross-entropy loss proven in language/vision
•Represents intersection of RL and self-supervised learning research

JAX GPU-Accelerated Environments and Data Scaling

Training uses JAX GCRL environments to collect thousands of parallel trajectories on GPU, generating hundreds of millions of transitions in hours. This massive data generation (50M+ transitions needed for performance gains) mirrors the internet-scale data that enabled LLM scaling.

•All experiments run on single 80GB H100 GPU despite 1,000 layer networks
•JAX GCRL enables 1,000+ parallel environment trajectories simultaneously
•Performance jumps only appear after 50+ million transitions
•GPU-accelerated environments make data collection, not inference, the bottleneck
•Provides RL equivalent of internet-scale data that enabled LLM breakthroughs

Implicit World Models Through Contrastive Learning

The approach performs implicit next-state prediction through binary classification rather than explicit frame prediction. This creates a learned model of the environment without high-dimensional complexity, similar to how poker players classify opponent hand ranges rather than predicting exact cards.

•Binary classification of future states acts as implicit world model
•Avoids expensive next-frame prediction while learning environment dynamics
•Analogous to classifying candidate possible worlds rather than generating them
•Learns meaningful state-action representations without explicit modeling
•More efficient than traditional world model approaches for goal-conditioned tasks

Robotics Applications and Goal-Conditioned RL

The work offers an alternative to imitation learning for robotics. Instead of collecting massive human demonstration datasets, goal-conditioned RL with deep networks could train robots with zero human supervision, scaling architecture rather than manual data collection.

•Current robotics relies heavily on imitation learning with massive human supervision
•Goal-conditioned RL enables training without demonstrations or human-crafted rewards
•More scalable than collecting human supervision data
•Tested on robotics-style tasks in JAX GCRL environments
•Could shift robotics from data scaling to architecture scaling paradigm

Future Directions: Distillation, Multi-Axis Scaling, and VLAs

Key future work includes distilling deep teachers to shallow students for efficient deployment, scaling simultaneously across depth/width/batch size with more compute, and exploring vision-language-action models for robotics. Deep networks unlock batch size scaling previously ineffective in traditional RL.

•Deep teacher, shallow student distillation for deployment efficiency listed on project website
•Deep networks unlock effective batch size scaling (previously ineffective in shallow RL)
•Opportunity to scale multiple dimensions simultaneously with distributed training
•Team exploring vision-language-action models for robotics applications
•Interest in hierarchical planning systems with fast and slow inference components
•Network stitching: generalizing from shorter sub-behaviors merged at test time

Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

0:00 / 0:00

Summary

Jump to Topic

From Undergrad Seminar to NeurIPS Best Paper

•Project started in Ben Eysenbach's independent work seminar for undergrads
•Advisor was initially skeptical—deep RL networks had failed before in literature
•Low-cost experiment enabled by year of infrastructure development by Mihaly
•Team discovered Best Paper award via email, not from review scores alone

Why RL Failed to Scale While Vision and Language Succeeded

•Language and vision use hundreds of billions of parameters; RL stuck at 2-4 layers
•Self-supervised RL learns representations via contrastive learning, not value functions
•Push together representations along same trajectory, push apart different trajectories
•Enables goal-reaching without human-crafted reward signals
•Initial attempts failed—required combination of depth + residual connections + layer norm

Critical Architectural Breakthroughs and Scaling Insights

•Scaling width increases parameters quadratically vs linearly for depth—less efficient
•Neither depth alone nor residual connections alone worked—needed both together
•Discovered critical depths (e.g., 64 layers) where performance skyrockets
•Most environments saturate at 64 layers, don't need full 1,000 layers
•Depth scaling more parameter-efficient and sample-efficient than width scaling

Blurring RL and Self-Supervised Learning Boundaries

•Code contains no 'maximize rewards' line—more similar to self-supervised methods
•Transforms RL from regression problem to classification problem
•Binary classification: same trajectory vs different trajectory for future states
•Leverages scalability of cross-entropy loss proven in language/vision
•Represents intersection of RL and self-supervised learning research

JAX GPU-Accelerated Environments and Data Scaling

•All experiments run on single 80GB H100 GPU despite 1,000 layer networks
•JAX GCRL enables 1,000+ parallel environment trajectories simultaneously
•Performance jumps only appear after 50+ million transitions
•GPU-accelerated environments make data collection, not inference, the bottleneck
•Provides RL equivalent of internet-scale data that enabled LLM breakthroughs

Implicit World Models Through Contrastive Learning

•Binary classification of future states acts as implicit world model
•Avoids expensive next-frame prediction while learning environment dynamics
•Analogous to classifying candidate possible worlds rather than generating them
•Learns meaningful state-action representations without explicit modeling
•More efficient than traditional world model approaches for goal-conditioned tasks

Robotics Applications and Goal-Conditioned RL

•Current robotics relies heavily on imitation learning with massive human supervision
•Goal-conditioned RL enables training without demonstrations or human-crafted rewards
•More scalable than collecting human supervision data
•Tested on robotics-style tasks in JAX GCRL environments
•Could shift robotics from data scaling to architecture scaling paradigm

Future Directions: Distillation, Multi-Axis Scaling, and VLAs

•Deep teacher, shallow student distillation for deployment efficiency listed on project website
•Deep networks unlock effective batch size scaling (previously ineffective in shallow RL)
•Opportunity to scale multiple dimensions simultaneously with distributed training
•Team exploring vision-language-action models for robotics applications
•Interest in hierarchical planning systems with fast and slow inference components
•Network stitching: generalizing from shorter sub-behaviors merged at test time

Latent Space: The AI Engineer Podcast

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

0:00 / 0:00

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Description

Summary

Jump to Topic

From Undergrad Seminar to NeurIPS Best Paper

Why RL Failed to Scale While Vision and Language Succeeded

Critical Architectural Breakthroughs and Scaling Insights

Blurring RL and Self-Supervised Learning Boundaries

JAX GPU-Accelerated Environments and Data Scaling

Implicit World Models Through Contrastive Learning

Robotics Applications and Goal-Conditioned RL

Future Directions: Distillation, Multi-Axis Scaling, and VLAs

Navigate

Chat with Episode

Summary

Jump to Topic

From Undergrad Seminar to NeurIPS Best Paper

Why RL Failed to Scale While Vision and Language Succeeded

Critical Architectural Breakthroughs and Scaling Insights

Blurring RL and Self-Supervised Learning Boundaries

JAX GPU-Accelerated Environments and Data Scaling

Implicit World Models Through Contrastive Learning

Robotics Applications and Goal-Conditioned RL

Future Directions: Distillation, Multi-Axis Scaling, and VLAs

Navigate

Chat with Episode