| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
ARC-AGI is redefining how to measure progress on the path to AGI - focusing on reasoning, generalization, and adaptability instead of memorization or scale.During this month's NeurIPS 2025 confere...
I'm excited today to welcome Greg Kamrad, who is the president of the ArcPrize.
That's right.
Thanks for coming here at NeurIPS twenty twenty five in beautiful San Diego.
Thank you, Diana.
So what does the ArcPrize Foundation do?
Yes. So the ArcPrize Foundation is a nonprofit, and but it's a little bit of a different nonprofit because we are very tech forward. And so our mission is to pull forward open progress towards systems that can generalize just like humans.
So according to Francois Challet, he defines intelligence as the ability to learn new things a lot more efficiently. What does that mean for founders as they look at all these benchmarks for all these model releases that are chasing MMLU bench numbers?
Yes. Absolutely. Well, so one of the cool things about ARC Prize is we have a very opinionated definition of intelligence. And this came from Francois Schollet's paper in 2019 on the measure of intelligence. And in there, you would normally think that intelligence would be how much can you score in the SAT test or how hard of math problems can you do?
And he actually proposed an alternative theory, is the foundation for what Arch Price does. And he actually defined intelligence as your ability to learn new things. So we already know that AI is really good at chess. It's superhuman. We know that AI is really good at go.
It's superhuman. We know that it's really good at self driving. But getting those same systems to learn something else, a different skill, that is actually the hard part. And so, Francois, alongside that proposal of, his definition of intelligence, He says, well, I don't just have a definition. I also have a benchmark or a test that tests whether or not you can learn new things.
Because generally, people are gonna learn new things over a long horizon, couple hours, couple days, or maybe over a lifetime. But he proposed a test called the ARC AGI, or at the time it was just called the ARC benchmark. And in it, he tests your ability to learn new things. So what's really cool is that not only humans can take this test, but also machines can take this test too. So whereas other benchmarks, they might try to do what I call PhD plus plus problems harder and harder.
So we had MMLU, we had an MMLU plus, and now we have Humanity's last exam. Those are going superhuman. Right? ARC benchmarks, normal people can do these. And so we actually test all of our benchmarks to make sure that normal people can do them.
And just a bit of context for the audience, this particular prize was famously one that a lot of LLMs with just pre training before AREL came in in the picture before 2024, all these large models, language models were doing terribly. Right?
Yes. Absolutely. Doing terribly. And, you know, it's kind of weird, but nowadays, it's hard to come up with problems to to stump AI. You know, back in 2012 with ImageNet, all all you needed to do is just show people an image of a cat, and you could stump the computer.
But when Francois Chole came out with his benchmark in 2019, fast forward all the way to 2024, I think at the time it was GPT four, the the base model, no reasoning. I think it was getting 4%, four or 5%. So clearly showed, hey, humans can do this, but base models are not doing anything. And what's really cool actually is right at o one, I remember testing o one and o one preview. Right when that first came out, I think performance jumped up to 21%.
So you look at that, and after five years, those were only 4%, and then in such a short time, it goes to 21. That tells you something really interesting is going on. So actually, we used ARC to identify that reasoning paradigm was huge. That was actually transformational for for what was contributing towards towards AI at at the time.
So much so that now all the big labs, x AI, OpenAI are actually now using ARC AGI as part of their model releases and the numbers that they're hitting.
Uh-huh.
So it's become the standard now.
Yeah. Well, I I tell you what. We're excited that the community is recognizing that ArcGI can tell you something. That's that's what we're excited about. And when public labs or frontier labs like to use us in terms of reporting their performance, it's really awesome that they too say, yes.
We just came out with this frontier model. This is how we choose to measure our performance. And so in the past twelve months, you're right. We've had OpenAI. We've had x AI with Grok four.
We've had Gemini with Gemini three Pro and DeepThink. And then just recently, Anthropic with, Opus four five.
That's cool. So Yeah. What's going well with all these releases?
So it's it's going really well that they're adopting it. However, we're mindful of vanity metrics that come from there too. So just because they use us doesn't necessarily, mean that our mission is done or our job is done or or what we're trying to do here. Because, again, if we go back to the mission of ARC Prize, it's to pull forward open AGI progress. So we wanna inspire researchers, small teams, individual researchers, and having big labs, give an endorsement more or less is really good for that mission, but it's it's also secondary to the overall mission.
So now that you've seen also lots of teams trying to ship AI products, what are most common false positives that you observe? Things that feel like progress but aren't quite progress because it's easy to perhaps just hit a benchmark somewhere
Sure.
And call it done.
Sure.
But it doesn't quite work.
Yeah. So when I answer that question, I put on my almost researcher hat. Because there's two hats that are very prominent within AI right now. There's economically valuable, like, you know, we're gonna go monetize this product hat, and then there's gonna be the, call it romantic pursuit of general intelligence hat. And I I'm wearing the latter hat.
So one thing that stands out to me is, of course, is everybody talks about it, all all the RL environments. And there's been famous AI researchers that have said, hey. As long as we can make an RL environment, we can score well on this benchmark or this domain or whatever it may be. To me, that's kinda like whack a mole. You know?
You're not gonna be able to make RL environments for every single thing you're gonna end up wanting to do. And core to RKGI is novelty and novel problems that end up coming in the future, which is one of the reasons why we have a hidden test set, by the way. So I think while that's cool and while you're gonna get short term gains from it, I would rather see investment into systems that are actually generalizing and you don't need the environment for it. Because if you see or if you compare it to humans, humans don't need the environment to go and train on that.
Perhaps walk us through a bit of the history of ArcGI versions. So it was ArcGI one Yeah. Two, and three is coming up soon
Yes.
Which is a whole new thing with game like environments Yes. And interactive. So walk us through the history and then tell us what three is all about.
Yes. Absolutely. So RKGI one came out in 2019. That was Francois Jolet proposed it. I think he made all 800 tasks himself within it, which is a huge feat in in and of itself.
And that came with this paper on the measure of intelligence. Now in 2025, just this year, earlier in March, we came with Arc AGI two. And so think of that as a deeper version or an upgraded version of Arc AGI one. Now, what's interesting is those two are both static benchmarks or, you know, call it meta static benchmarks. We're coming out with RKGI three next year, and the big difference with RKGI three is it's gonna be interactive.
So, if you think about reality and the and the world that we all live in, we are constantly making an action, getting feedback, and kind of, going back and forth with our environment. And it is in my belief that future AGI will be declared with an interactive benchmark because that is really what reality is. And so, v three is gonna be about a 150 video game environments. Now, we say video game because that's an easy way to communicate it, but really it's an environment where you give an action and then you get some response. Now, the really cool part and one of the thing that jazzes me up about v three the most is we're not gonna give any instructions to the test taker on how to complete the environment.
So there's no English, there's no words, there's no symbols or anything like that. And in order to beat the benchmark, you need to go in, you need to take a few actions, and see how your environment responds, and try to figure out what the ultimate goal is in the first place.
I tried a bunch of those games. They're actually fun.
Yeah. They're cool. And much like Arc one and Arc two, we're testing humans on every single v three game. So we will recruit members of the general public, so accountants, Uber drivers, you know, that type of thing. We'll put 10 people in front of each game.
And if each game does not pass a minimum solvability threshold by regular humans, then we're gonna exclude it. Now, again, I just have to emphasize, but that's in contrast to other benchmarks where you try to go harder and harder and harder questions. But the fact that ARC three will be out there and regular people can do it, but AI cannot do it tells you, well, there's something missing still. There's something clearly missing that we need to, need new ideas for research on.
So there's this big theme in terms of measuring intelligence with human capabilities.
Yes.
So there's this growing idea that accuracy is not the only metric that matters to models.
Yes.
But also the time and amount of data that it takes to acquire new skills, which is what this whole spirit of our AGI is.
Yes.
So, I guess the question is how close are we to evaluating models in human time?
Yes. So, with regards to human time, we actually see time as a little bit arbitrary. Because if you throw more compute at something, you're gonna reduce the time no matter So it it's it's almost just a decision on how much compute do you want, which is how much time it's gonna take, which tells you that wall clock may not be the important part for what we have intelligence here. But there's two other factors that go into the equation of intelligence. Number one is gonna be the amount of training data that you need, which is exactly what you said.
And then number two is actually the amount of energy that you need in order to execute upon that intelligence. And the reason why those are so fascinating is because we have benchmarks for humans on both of those. So we know how many data points a human needs in order to execute a task, and we know how much energy the human brain consumes to execute a task. So with RKGi three, the way that we're actually gonna be measuring efficiency, just by accuracy. I I told you they're video games and they're turn based video games and so you click, you might click up, left, right, down or something like that.
And we're gonna count the number of actions that it takes a human to beat the game and we're gonna compare that to the number of actions that it takes in AI to beat the game. So back in the old, Atari days in 2016 when they're making a run of video games then, they would use brute force solutions and they would need millions and billions of frames of video game and they would need millions of actions to basically spam and brute force the space. We're not gonna let you do that on ARC three. And so we're basically gonna normalize AI performance to the average human performance that we see.
That's very cool. Yes. My last question.
Yes.
Let's wave a magic wand. And then there's a super amazing team that suddenly tomorrow launches a model that scores a 100%
Yeah.
In the ARC AGI benchmarks. What should the world update about the priors of what AGI is? Yeah. How would the world change?
Well, it's it's funny you asked that. The what AGI is question is such a deep topic that we can go much deeper on. But so, from the beginning, Francois has always said that the thing that solves Arc AGI is necessary for AGI, it's not sufficient. So what that means is, it the thing that solves Arc AGI one and two will not be AGI, but will it will be an authoritative source of generalization. Now our claim for v three is that it no.
The thing that beats it won't be AGI. However, it will be the most authoritative evidence that we have to date about a system that can generalize. If a team were to come out and beat it tomorrow, we would, of course, wanna analyze that system, figure out where still are the failure points that come from that. And like any good benchmark creator, we wanna continue to guide, the world towards what we believe to be proper AGI. But ultimately, Archprize, we wanna put ourselves in a position when we can fully understand and be ready to declare when we do actually have AGI.
So if that team were to do it tomorrow, we'd wanna have a conversation with them. I'll put it that way.
That's a good way to wrap. Awesome. Thank you so much for coming and chatting with us, Greg.
Thank you, Diana.
How Intelligent Is AI, Really?
Ask me anything about this podcast episode...
Try asking: