| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale.During this month's NeurIPS 2025 confere...
Greg Kamradt, President of ARC Prize Foundation, explains how ARC-AGI benchmarks measure AI's ability to learn new skills rather than memorize solutions. Unlike traditional benchmarks that test harder problems, ARC-AGI focuses on generalization—ensuring normal humans can solve the tasks while AI struggles. The upcoming ARC-AGI 3 will introduce interactive, game-like environments without instructions, measuring efficiency by comparing AI actions to human performance rather than just accuracy.
ARC Prize Foundation redefines intelligence as the ability to learn new things efficiently, not just solve hard problems. Based on François Chollet's 2019 paper, the benchmark tests whether systems can acquire new skills like humans do, rather than measuring performance on increasingly difficult PhD-level problems like MMLU.
Base LLMs performed terribly on ARC-AGI (GPT-4 at 4-5%) until reasoning models emerged. OpenAI's o1 jumped to 21% performance, revealing that reasoning paradigms were transformational. Major labs including OpenAI, xAI, Google, and Anthropic now use ARC-AGI as a standard benchmark in their model releases.
Kamradt warns against relying on reinforcement learning environments as a measure of true progress toward AGI. While RL can achieve short-term benchmark gains, it's like 'whack-a-mole'—you can't create RL environments for every future task. True generalization means systems should learn without needing custom training environments, unlike humans.
ARC-AGI 1 (2019) had 800 static tasks created by Chollet. ARC-AGI 2 (2025) deepened the benchmark. ARC-AGI 3 (upcoming) introduces 150 interactive game-like environments without instructions, requiring systems to learn goals through trial and error—mirroring how humans interact with reality.
ARC Prize measures intelligence through data efficiency and energy consumption, not wall-clock time (which is arbitrary based on compute). ARC-AGI 3 counts the number of actions AI takes versus humans to complete tasks, preventing brute-force solutions that require millions of frames like old Atari RL approaches.
Solving ARC-AGI is necessary for AGI but not sufficient—it proves generalization capability, not full AGI. A system beating ARC-AGI 3 would provide the most authoritative evidence of generalization to date. ARC Prize positions itself to analyze such systems and ultimately declare when true AGI arrives.
How Intelligent Is AI, Really?
Ask me anything about this podcast episode...
Try asking: