| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
From undergraduate research seminars at Princeton to winning Best Paper award at NeurIPS 2025, Kevin Wang, Ishaan Javali, Michał Bortkiewicz, Tomasz Trzcinski, Benjamin Eysenbach defied conventional w...
Princeton researchers won NeurIPS 2024 Best Paper by scaling reinforcement learning networks to 1,000 layers—a feat previously thought impossible. The breakthrough came from shifting from traditional value-based RL to self-supervised contrastive learning, treating RL as a classification problem rather than regression. Combined with architectural tricks like residual connections and layer normalization, they achieved state-of-the-art performance on goal-conditioned RL tasks using GPU-accelerated JAX environments, all trainable on a single H100 GPU.
The team shares how their award-winning paper originated from an undergraduate research seminar at Princeton. Despite advisor skepticism about deep networks in RL (historically limited to 2-4 layers), they pursued the high-risk bet with relatively low cost due to existing infrastructure.
The fundamental problem: while NLP and vision scaled to billions of parameters, RL remained stuck with shallow 2-layer MLPs. Traditional value-based RL doesn't scale due to noisy, biased TD error regression. The solution: shift to self-supervised RL using contrastive learning on state-action-future state representations.
Success required non-obvious combination of factors. Scaling width or batch size alone didn't work. The breakthrough came from combining increased depth with residual connections and layer normalization, revealing critical depth thresholds where performance multiplied dramatically.
The method challenges the definition of reinforcement learning itself—no code explicitly maximizes rewards. Instead, it shifts the learning burden from noisy TD regression to binary classification (is this future state on the same trajectory?), leveraging proven scalability of cross-entropy loss and representation learning.
Training uses JAX GCRL environments to collect thousands of parallel trajectories on GPU, generating hundreds of millions of transitions in hours. This massive data generation (50M+ transitions needed for performance gains) mirrors the internet-scale data that enabled LLM scaling.
The approach performs implicit next-state prediction through binary classification rather than explicit frame prediction. This creates a learned model of the environment without high-dimensional complexity, similar to how poker players classify opponent hand ranges rather than predicting exact cards.
The work offers an alternative to imitation learning for robotics. Instead of collecting massive human demonstration datasets, goal-conditioned RL with deep networks could train robots with zero human supervision, scaling architecture rather than manual data collection.
Key future work includes distilling deep teachers to shallow students for efficient deployment, scaling simultaneously across depth/width/batch size with more compute, and exploring vision-language-action models for robotics. Deep networks unlock batch size scaling previously ineffective in traditional RL.
[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton
Ask me anything about this podcast episode...
Try asking: