| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
This year-end live show features nine rapid-fire conversations to make sense of AI’s 2025 and what might define 2026. PSA for AI builders: Interested in alignment, governance, or AI safety? Learn more...
This podcast is sponsored by Google. Hey, folks. I'm Amar, product and design lead at Google DeepMind. Have you ever wanted to build an app for yourself, your friends, or finally launch that side project you've been dreaming about? Now you can bring any idea to life, no coding background required, with Gemini three in Google AI Studio.
It's called vibe coding, and we're making it dead simple. Just describe your app, and Gemini will wire up the right models for you so you can focus on your creative vision. Head to a i.studio/build to create your first app.
The first ever live show with a year end retrospective and 2026 look ahead. One of the big reasons I've been interested in doing this is when obviously, everything is getting noisier and noisier. Right? I set out for myself this idea of the being the AI scout and hopefully having no major blind spots in the AI landscape. That is becoming basically impossible with everything going vertical and AI touching everything.
One thought that I've had and kind of, mantra I've tried to encourage others to adopt is shorten your timelines and do anything you can to make your activities denser in time, you know, more information dense. And so the big way I think about today's experiment is, obviously, there's no, you know, timeline shorter than live. And instead of doing an hour and a half or whatever with, one guest and going comprehensively, sometimes described as a forced march through their worldview. We can try to do the twenty minute version, get as much of the alpha as we can in a really short time, and then, you know, give people nine of those over a three hour time frame, and we'll see how it goes. I'm excited to find out.
One of the things that we were talking about as we put this together was also that, you know, sometimes a a longer podcast is often a a a capstone to, a body of work. And one of the things that struck me was that a lot of the people that we want to talk to are actually right in the middle of their life's work right now. I wanted to see whether we could get more frequent touches while things were happening. Just kind of capture this moment. This is the end of your Anthropocene, you know, that transition point between two eras, which has been particularly fascinating for me.
And I think our first guest is very sensitive to the fact that we are transitioning. Nathan, why don't you introduce Vee?
Yeah. Who better to comment on the possible end of the Anthropocene than Zvi Moshevitz, who I think probably needs little introduction in this space because he's a prolific blogger and analyst and, really, you know, puts out canonical, assessments at a really remarkable pace of all the new model launches and the strategic landscape. So, Zvi Naschwitz, welcome to our experimental year end live show.
I love it. We're live.
We're live. So to kick us off, and I I think you're the the perfect person to kick us off, I guess for starters, like, we've got a discourse that is increasingly perplexing to me in terms of how fragmented it is. On the one hand, we do have this sort of, I think, increasingly credible idea that maybe we are at the end of the human era or the beginning of the end of it. And then at the same time, we've got these posts still going viral like AGI is impossible because its computation is a physical process or something still plenty of denialism. For starters, how would you assess the vast gulf between people who, I would say, like, seem to get it and and seem not to, although, obviously, that characterization would be disputed?
It is very hard to make a man understand something when his salary depends on not understanding it, and misinformation is demand driven, not supply driven. Right? Like, basically, these people, for their own cognitive peace of mind, for their own narratives, for their own business plans, for their own everything, need to believe that AI is normal technology. Need to believe often that AI will never do anything else. They just want it to go away.
They really really badly want it to go away. Like lots and lots of people want this, and sometimes it's for business reasons, sometimes it's for disingenuous reasons, sometimes it's just to feel better. So what do they do? They just grasp onto whatever lets them tell that story, and they will repeat it endlessly. They will latch onto every so often.
Yeah. Someone will generate some study, some paper, some post, some hypothetical, some whatever. You see the same thing in the AI existential risk debates, right, the same thing in safety debates, where you'll see the same exact arguments that we were debating back in 2006 often out of completely different architectures. Just try it back up. One zero one errors get made over and over again prominently like this week.
NOA sniff. Put out, oh, the AI don't worry about AI, it'll just have the utility function which will self modify to wire head. And then Tyler Cohen repeats this as if, you know, nobody has done any of the normal experiments or thought about humans or philosophy or any of the underlying logic of what's actually going on. We're still here, These people like will just keep repeating the same things because that's the level of discourse at which people who aren't really in the weeds, who aren't trying to understand things can ever tolerate. It's unfortunate, but the default for most people is to think, oh, AI can do what it can currently do that I know it can do, which is much less than what it can actually currently do, which is then much less than it can do with scaffolding and other inevitable things and people learning how to use it that automatically 100% is going to happen regardless of what anybody does, which in turn is much less than what's going to happen when they actually keep releasing these new models, when they keep advancing the frontier.
And sometimes they see an advance, they see a new thing happen and they adjust, and sometimes they don't. But like the latest big thing was CPD5, right? Joked on Twitter, but only half jokingly, that because the OpenAI can't get version numbers correct, and so they called four two five and four three five one instead of calling five two five, Now we're selling h two hundreds to China because the entire White House is, like, to be convinced that, like, AGI isn't coming anytime soon because the even number release was disappointing.
I think one of the things that strikes me is when does it stop becoming normal technology? What do you need to see in order to say because at some point, I think it's easier to set a stake in the ground now, and then we can see ourselves go past it. Because if not, what ends up happening is that you end up this gradual descent where you don't notice that you're passing the threshold. So what do you think is that marker that marks this normal technology to non normal technology threshold?
So to me, like the non normal technology is either recursive self improvement where the AI is substantially advanced research and work into forever AI in a way that causes it to normally you'd have an abs curve, normally you'd have it like, know, slow down unless you devoted orders of magnitude more resources. If that changes, then that would count. Or alternatively, if it starts to create mass unemployment or mass displacement, the situation where like you take my job and normally I would just move on to another job that gets opened up because we're wealthier and we have more technologies and we have more options, but the AI just immediately takes that job too because the AI can do pretty much everything. That also to me is not a normal technology anymore, in a different but also very important sense. But I'd also say that we're currently pretty much there.
Like Dean Ball made the assessment, I think it was yesterday, that quad code plus quad opus 4.5 is AGI because of just the quality of computer use and coding that it can do, and that's not what we traditionally mean by AGI, but the coding multiplier on the top AI people in the world is reported to be like on the order of two to three times already. And for me as a amateur programmer who mostly doesn't do that, it's more on the order of 10 to a 100 times. Right? It's from you can't do these things at all to these things are worth doing casually just by asking because you can't. So like you know, Twitter decided to nuke the ability on Twitter Pro to let you actually have a following list, follow the people who you follow, and then see a chronological feed.
They just like screw this feed, it's gone. And what did I do? I didn't fret, I went to Claude, transferred all my followings to a list, fifteen minutes later, problem solved.
So I think there's like three concepts kind of like interrelated there. One was the idea of technological unemployment. Yeah. One was the recursive self improvement, and the other was really uplift, uplift of individuals. And I see the uplift of individuals causing some of the unemployment.
It's really not the AI which is causing unemployment, but it's really like the senior partner at the law firm who doesn't want to hire junior partners anymore because he doesn't need them because he's using AI because he's been uplifted. So in that case, does that just mean if you wait long enough, people reallocate to new jobs?
Right. So the question right? The question is, is this a one time productivity shock? Is it a one time effect, in which case we will reallocate, everything will be fine, it will remain a normal technology? Or will the uplift keep accelerating?
Will you get more and more uplift and will this happen faster than people can adjust, and we will just actually have use for less and less people? And also, does uplift transition into replacement, right? Does augmentation become automation? And I think that by default, absolutely it does in more and more places, and what you see is you see augmentation precedes automation in many of these places. You start to see, okay, I can I can do this by coding as a human, and it's a lot of hard work?
Okay. I can do this by coding, or I can, like, the law firm can produce the report, but I have to guide the AI every step of the way, but some of these steps I can, like, have the AI do them, and then I can check its work, and then I can I have to still be bespoke, I have to understand what I'm looking at, have to check it? And then slowly but surely you start checking less and less of it, you start automating more and more of it, and then at some point you realize, oh, I can just press a button, and it does an hour of work, and then it becomes two hours of work. This is the famous meter graph of how long you can code, And then it becomes a day, and then it becomes a week, and then it becomes, oh, my entire job. I don't need an employee at all.
It's not like yeah. But that progresses. And again, like if this stuff stalls out, you're dealing with a normal technology. Right? You have an s curve.
No fixed multiplier within reason on people's productivity is going to be a transformative non normal technology. But if it keeps going, it will be. So that's what we still have to watch out for. But the primary reason why I raised the third segment of augmentation and uplift that you pulled out is because it directly causes the other two things, right, if it keeps going. What
would you need to see within the next year to tell you that this is actually happening?
So my answer to the answer is nothing. I'm already convinced. Like, if we're just being straight, right, frank about it. But in order to, like, feel like I had a more convincing case, I would say you would start to see more ability to self correct, you would start to see advances in particular computer use in ways that like you didn't feel like it was gonna fall over flat at any given time or sort of some crucial individual motions. But like it's all very continuous right?
Like you don't there's no specific point at which in our experience things go crazy until they suddenly do. If you have a certain specific point where you're like oh okay, now we're in the takeoff, now we're doing it, it'll be pretty hard to mistake but also be way too late. And then like we're trying to figure out what before that indicates it's going to happen, and I think that like it's kind of like you know you have you know a series of people have been talking about bottlenecks. Right? You've got like, you know, all these different bottlenecks that prevents you from going crazy because, know, you move at the speed of the slowest ship in this production process.
And then you see them get removed one by one which provides some acceleration, and like when the last one goes you go when the last few go you start to see massive speed ups. And so you start to see substantial complex real world tasks, especially things that aren't just coding, where the AIs are increasingly able to automate them really well and able to do like large periods of time's worth of work from people into bespoke environments. And that includes being able to have the AIs build the tools for you to do that. I think that one of the things that skeptics are very much missing is this idea that no, it doesn't have to be able to do this with a command out of the box. If a human understands the task and can describe the task, then now it can immediately create an app and a tool.
And in fact, quad code you can first have it build the tool guide from the app, and then have it build a tool for itself to be able to use the app. And then in itself, like, now you have had to give v I three commands, but now if essentially, like, a couple of sentences of text being written into quad code across time, Claude can use the thing.
How much emphasis do
you place on continual learning? It seems like very long context and context stuffing still leaves people with a sense that something is missing. If you imagine the drop in knowledge worker of the potentially not too distant future, a big part of what people, I think, intuitively imagine there is that this thing sort of absorbs a bunch of information upfront and kinda gets the vibe of the place and how we do things around here. And then it can really sort of slot in not as this, like, kind of generic AI that's good at everything, but is always a little bit out of the loop, but is, you know, really plugged in in the way that humans are. Does that feel to you like a big piece, or do you think people are overemphasizing that?
I so I have several times been covering podcasts by Durkash, in which point, you know, he emphasizes continuous continual learning. He's the continual learning as necessary champion of the world right now. And every time I start, my hands reach out to a certain type, here's how I do it without continual learning. Here's the very simple set of things you do to effectively allow the AI to continuously learn without technically having continual learning. So for example, we just heard that quad code can you you tell it here's a thing that we want to be able to do on a regular basis and here's how that thing works.
Oh, okay. Let me build an app. We build code that includes all of the knowledge that you have told me about how this thing works implicitly in its logic, also with comments and notes, and then give me a tool to trigger this thing, and now I can do this thing the way you wanted to. I have continuously learned to do the thing. And now I can refer to the documentation, I can refer to the comments, I can figure this stuff out, can pull it up on demand.
How is this so different? And the joke always is okay, that'll be a $100,000,000 a year please, and then I'll solve your problem for you. But the real answer is it's all very obvious to me that this is just a skill issue, this is just a nobody has tried that seriously to get around this. And yeah, if you want your AI to be able to learn new skills, be able to develop new tools and abilities continuously over time that are adapted to your situation, I think at Opus four point five we're at the point where like this is kind of a prompt engineering skill issue, this is like very straightforward with something that you can do without actually requiring anybody to build a generalizable new tool. And within the year I would be shocked if you know we didn't have models that make this pretty easy.
Like, of course, Claude five is gonna be you've got Opus six is gonna have no trouble with this. Right? Like, GPT six is not gonna, like, have this issue. It's silly.
So we're gonna dig into that from a couple different angles coming up, but being mindful of time and hopefully staying roughly on schedule. I wanna change gears just for the last five minutes here and kinda get your zoomed out sense of, like, where we are in the big picture, what's your latest p doom, maybe a quick breakdown of, like, what the sources of p doom are, and then I'd love a brief live players roundup. You know, who and that starts with just, like, who are the live players in your mind and maybe a little commentary on the most important ones.
Yeah. So I'll start with the live players because it's a bounded tech. So basically, there's three labs that I think matter far more than everybody else, OpenAI, Anthropic, and Google DeepMind. Historically it was to be roughly in that order, I'm starting to wonder about that. Anthropic has been continuously impressing me, Anthropic is now at the top of arena which historically it never been able to without like, clearly without trying to in some sense, I think that's like more important than people realize.
But it's by part most importantly, it's the best coding model, best coding environment by a lot of people's reports. It's the model I wanna use day in and day out, and I have five two and it just is not tempting me. And so, you know, these players are all racing for different forms of the thing. OpenAI is trying to be a sort of consumer facing company. They say they're gonna give it back to business, and they're still racing for super intelligence.
But, you know, you can see where the where the the hires and the culture are going. Anthropic is going straight for coding and ASI. Google is trying to do everything at once because different managers are trying to do different things and fighting with each other. It's a giant mess, but they have overwhelming advantages and resources, so never count them out. Also, like they seem very far behind in the alignment race in the sense that, like, Yamanai three is very dangerously misaligned in various ways if it was actually scaled up or given any real responsibility or power.
Whereas Opus 4.5 in this whole document, I can't have enough time to go into it, but they kinda show us the way of, like, how you can do much better at current levels than we've done in the past, and that is part of what gives me hope. So, like, Gemini three substantially increased might be doom. Opus 4.5 significantly decreased might be doom. If you're not making heinous updates, even if you don't verbalize them specifically, I think you're making a mistake. Things happen all the time.
So in terms of the bigger picture, essentially on policy we're basically playing defense. We have forces that are winning, that are trying to actively stop any attempt to do anything rather than anybody attempting to do anything that would be particularly useful. We're trying to hold together what little we have on the, you know, the federal government front and so on. We do have, you know, a lot of people who care about that including in government, including in congress, including in, you know, the bureaucracies. And I think, you know, we're doing a decent job of fighting defense, but we're pretty much, I think, in defensive mode until David Sacks is out of the White House.
I'm not sure what else we can hope for. I think he's the primary effective villain in this story, and then the states are doing some good. I think SB 53 and the Rays Act could plausibly advance our situation there, but the real action is at the left. I think sort of we're in a world in which I think we are determined to be maximally hot dignity to try and lose relatively easy scenarios and still have things go haywire. But at the same time, I do think we've seen evidence that the technical situation is more hopeful than I would have thought more often.
And the difficulty level is likely somewhat lower than I thought it was, especially like recently around the the sole document and around several, other anthropic research papers have shown that at least in terms of the short and medium term of alignment, there seems to be basins you can aim at. There seems to be ways to get the models to effectively want to help you with these things, and that like this might be actually commercially highly viable because it leads to a better model that is better that's more pleasant to use, that people want to use, that is more effective, that is actually better at giving you the things you want. And so we can do the kind of race to the top that they've always wanted to do, and products always wanted to do, and force us in that direction. So I'm very hopeful on those fronts, but overall I'd say we're still in a really terrible situation because you as you say, what's your breakdown of your P2? Which of these various different ways things could go wrong do you think is most likely?
And I think that's sort of the right perspective to think about it is we don't have to dodge one particular thing. Think a lot of people are like p doom is point 1% or whatever. I think of it as unless the specific narrow scenario happens, we're fine. That specific narrow scenario is unlikely therefore we're fine. Whereas I'm thinking of it as everything is trying to kill you in some abstract sense, not like deliberately trying to kill you, but the dynamics of all the things are trying to lead to unsustainable situations that humans don't survive in or lead to very bad ends, and we have to navigate a lot of impossible difficulty level problems in order to get around it, fundamentally we have to align the models well enough in the medium term to then align the models in the longer term, and we have to solve for disempowerment along the way, and we have to solve for potential other concerns, including the reverse of all of these concerns.
And it gets very complicated, but I'd say breaking it down, my overall PDUM is still in the 60 to 70% range. And I would say that the bulk of that operates through cognitive disempowerment style scenarios, because I think that just sort of automatically happens. But you know, I am in that sense like they all get tangled up, right? You get cognitive disempowerment when we fundamentally had an alignment failure. Also in the sense of like if cognitive disempowerment is about to happen anyway naturally, if we're gonna hand the AI like people talk about you know is a rogue AI going to suddenly like take over or something.
You're gonna be handed power anyway. Do you go rogue? Right? In some important sense. There's no robot.
They do that.
We detect both hope and hopelessness there. Thank you. Thank you, Zvi, and we'll you know, we hope to have you back sometime soon if we do this again.
Yeah. I I enjoyed it. I think it was good. Let's do it again. Alright.
Cheers. Thank you, Zvi.
Alright.
Hey. We'll continue our interview in a moment after a word from our sponsors. If you're listening to this podcast, you're probably thinking seriously about where AI is headed and maybe about how you can actually contribute to making it go well. I wanna tell you about an opportunity that could become a pivot point in your career and a springboard for you to make a positive difference, a program that I've been so impressed by that I've supported it with a personal donation. I'm talking about MATS, a twelve week research program that connects talented researchers with top mentors working on AI alignment, interpretability, security, and governance.
These are researchers at Anthropic, Google DeepMind, OpenAI, the AI Security Institute, Redwood Research, METER, the AI Futures Project, Apollo Research, GovAI, RAND, and other leading organizations. The track record here is remarkable. Mats has accelerated over 450 researchers, with 80% of alumni now working in AI safety and security. 10% have cofounded AI safety initiatives, including Apollo Research, whose cofounder and CEO made the twenty twenty five Time 100 AI list. MATS fellows have coauthored over a 120 publications with more than 7,000 citations and helped develop major research agendas, like activation engineering, developmental interpretability, and evaluating situational awareness.
The program is fully funded. A 15,000 stipend, $12,000 compute budget, housing, catered meals, travel, and office space in Berkeley or London. Everything you need to focus entirely on research for three months with the chance to extend up to a year. Applications open December 16 and close January 18. If reducing risks from advanced AI is something you care about, you should apply.
For more information, check out matt'sprogram.org/tcr. That's matsprogram.org/tcr, or see the link in our show notes. Are you still jumping between multiple tools just to update your website? Framer unifies design, content management, and publishing on one canvas. No handoffs, no hassle, just everything you need to design and publish in one place.
Framer already built the fastest way to publish beautiful production ready websites, and it's now redefining how we design for the web. With the recent launch of Design Pages, a free canvas based design tool, Framer is more than a site builder. It's a true all in one design platform. From social assets to campaign visuals to vectors and icons all the way to a live site. Framer is where ideas go live, start to finish.
And now they've added a Framer AI layer to make it all faster and easier than ever. With Wireframer, you can skip the blank canvas and get a responsive page with structure and starter content ready to edit. With Workshop, you can create new visual effects, cookie banners, tabs, and more. No coding needed. And with AI plug ins, you can connect top models from OpenAI, Anthropic, and Google to generate images, rewrite text, generate alt text, and more.
Ready to design, iterate, and publish all in one tool? Start creating for free at framer.com/design and use code cognitive for a free month of Framer Pro. That's framer.com/design. Use promo code cognitive. Framer.com/design.
Promo code cognitive. Rules and restrictions may apply. Welcome, Greg.
Hello. Thank you for having me.
Yeah. Excited to have this conversation. So Greg, leads the Arc AGI Prize, and this is, you know, again, for anybody who's, obsessed with AI enough to tune into this, I think that needs a little introduction. There's it's been a big year. I wanna just kinda get into it from a a few different angles.
But Yeah.
Maybe for starters, because we're I think one of the the meta themes right now is, like, AI can do so many incredible things. Right? We're seeing, like, open math problems solved and, you know, meaningful contribution to the advance in science. And, you know, my son has cancer, and I've been using it just nonstop in triplicate in the hospital room to double check the doctor's work, and it's been amazing, and they are going step for step with the doctors. And so there's, like, all these amazing, amazing accomplishments.
But then there's still this sense that, like, something is missing. And I think you're, you know, really focused on figuring out what exactly that is and and what can be done to patch those gaps. For starters, could you maybe give us a little bit of a sense for, like, right now, as we sit here, you know, close to the end of 2025, what are the sorts of things that are still easy for humans but are hard for AIs, which is really, you know, as I think about it, the sort of animating idea of the of the ArcGI Yeah. Benchmark and the and the prize.
Yeah. Absolutely. Well, first of all, thank you very much for having me on here today. ArcGI started with Francois Scholleys' first benchmark in 2019. Excuse me.
And he had a strong opinion about the definition of intelligence. And this is gonna start to answer your question here because that definition is a dividing line that shows us clearly the types of tasks that are easy for humans and hard for AI today. And that definition of intelligence was a systems, meaning a human or artificial, ability to learn new things. So the types of tasks that we're seeing today, which are very easy for humans but are hard for AI, require learning. And I know you brought up continual learning earlier.
I'm sure we'll get into more of that there. But our benchmark is almost like a meta benchmark in which we teach you something new at the question time, And then we see, did you learn that new thing that we just taught you? And so what we find is that humans are extremely good at this and especially at being sample efficient with their learning. So they only need two or three examples for something. Whereas AI can generally in general, AI can learn any one scoped domain given enough data, but that's not the efficiency that we're looking for here.
So humans are extremely sample efficient. And so if we come up with problems that humans can do and AI cannot, we can then assert that AI cannot learn as efficiently as humans, and therefore, we don't yet have human level intelligence in our AI right now.
So part of me wants to preserve that advantage for as long as we can, but, clearly, the gap is is shrinking. Yeah. It's great chart.
Yeah. This is the, you know, this is the money chart. This is the, 390 times improvement in cost efficiency for Arc AGI one over the course of one year. Can can can you talk a little bit about this one?
Yes. Absolutely. So this was December 2024. We get an email from OpenAI that says, hey. ArcBrise, we have a new model we wanna test.
And at that point, twelve months ago, the highest score on Arc AGI one, which is what we're looking at here, I I I need to go back and check, but it was in the it was either was in the 20 to 40% range. And then OpenAI, they say, hey. We have a model, and we're claiming that it scores 87% on RKGI. So this is almost double the performance. We had never seen anything like this beforehand.
And so we do what we do for all lab scores and we did a verification. And that verification says, hey. You've you claim this on the public tasks that you have, but there's a risk of overfitting and there's a risk of leaking the answers to the models. So we have a holdout set of tasks that we'll do that if the score there corroborates what you have on public, then, yes, this is a verified score. And so we did that.
And what we noticed is that there was an absurd amount of tokens being used for that, score performance that they had. And we asked, hey. How should we price these tokens? Because it was an unreleased model. And given the price that they had recommended to us, it came out something in the on the order of magnitude of a thousand dollars per task or something along those lines for what it was.
And so, yes, the 87% was legit. It was an amazing score. There was clearly something impressive was going on with the model. It was very expensive for the tokens. Now fast forward to about a week ago, we announced the results on GPT 5.2.
And with that model, not only did it get comparable percentages, I think actually, no. In fact, it got it beat it. It got 890%, but it was almost, like we said, it was about 390 times cheaper per token than what we had just done last year twelve months ago. So there's a lot of things that are going on here. The models are getting better.
They're getting more efficient to serve, and we see that in the same, performance being 390% cheap or 390 times cheaper here.
Yeah. That's incredible. Maybe circle back to the large language models at the end. But I I think one of the things that has been really cool to watch about the ArcPrize, and, you've got now ArcGI two and ArcGI three as well. And by the way, the these are now those scores that we just referenced are above the human performance.
Right? That per human performance is what? Like, eighty, eighty five?
So what what we do so for every single arc task is solvable by humans. So there's a million ways you could actually slice and dice this and cut this, but the main message across all these tasks is that 100% of the tasks have been solved by humans on them. So the 85%, I mean, like I said, you can slice and dice this in many ways, but each one of these tasks is human doable. And that's what's interesting about ArtPrize and a lot of the benchmarks that we have. You may be talking to other benchmarks, you may see other ones on there and they take a different approach of PhD plus problems.
So harder and harder problems that are out of reach for common folks. Right? And what we see, and you brought up this point earlier, is that there is superhuman performance on those benchmarks, but yet we don't have the economic transformation that one may expect by having this thing that can do PhD plus plus problems. And so as I said in the beginning, Archprize is obsessed with problems that humans can do and AI cannot. And so we actually, for each one of our benchmarks, we go through and we test a panel of humans on each one of the tasks.
And I'm happy to talk about what we're doing for v three here because we are going through insane lengths to go and do that. But every single one of the tasks is doable by, you know, normal folks.
There have been a lot of interesting approaches with scaffolding. There have been a lot of interesting approaches with smaller models with various test time fine tuning and other strategies there. Yep. What has stood out to you the most over the last year in terms of things that didn't necessarily require huge hyperscaler resources to do that move the needle and kind of brought new insights to the broader community?
Yeah. Absolutely. So we actually so we see three categories of types of things that we test. Number one is gonna be frontier models. Number two is going to be we'll call it novel approaches, but not built on top of frontier models.
And then number three is refinements. So going to number three just very briefly, that's our refinement approaches. And so we've had a few public submissions this year from Poetic, from Jeremy Berman, from Eric Peng, who build a you could call it a harness, but that's really doing a disservice to to to what they're doing there. But they're building on top of frontier models and doing amazing search, parallel calls, you know, just going very, very deep into squeezing out more performance about these individual models. However, the some of the most interesting performance that we get is actually on the one that you were just highlighting there.
That's an example. It's called tiny recursion model. What we saw in the beginning of the year was this model came out and it was called hierarchical reasoning model. What we saw here is that they were taking a refinement approach, but with an extremely small model that came from here. And what's really interesting is the TRM, which was built on top of HRM afterwards, those had those were incredible different submissions that we saw.
And in the papers that they were used for, they demonstrated the performance on three different datasets. Number one was Sudoku. Number two, I believe was some sort of elementary maze, but then number three was RKGI. So we were very excited to have them use RKGI as the way that they wanted to communicate their performance on there. And we I would go as far as to say is that if they didn't use ArcGI to communicate the performance on there, then it wouldn't have had nearly as the same impact as just using Sudoku and Amaze that that that comes on there from there.
We're seeing awesome performance from small models through the recursion method.
ArcGI is supposed to measure intelligence of a kind that somewhat closer to the human conception of intelligence. Yet we have models which perform on their own and models which perform better with scaffolding. Does the intelligence live in the models or in the scaffolding? Or is it the combination of the two? Is it you know, how does that work when you have the scaffolding adding so much in terms of intelligence?
Yeah. You you bring up a great point, and one of my favorite places that cross cross referenced this was on a latent space pod with Noah. They asked him, do you think AGI will have a scaffold or will it not? He was under the impression that, no, AGI will not have a scaffold. I actually said in a different camp that Noah, there's probably gonna be scaffold around it.
And I think this is one of those words where you can you can easily misalign what you mean by definitions. Like, what is a scaffold, for example, is when GPT 5.2 Pro, if they throw 10 different parallel tool reasoning chains and then they combine at the end for the best answer, is that a scaffold? Is that a scaffold just because it's on the model side? Does the developer need to do it? So I think that you can argue about definitions here, but it is in my opinion that our best baseline for what AGI will look like is gonna be represented after what our only proof point of general intelligence is, which is the human brain.
Does the human brain have scaffolding? Yeah. Has a bunch of different neurons that are connecting different parts of the brain. It has different sub pieces of the brain that specialize in different different types of processes. So there's a scaffold in the brain.
I find it very hard to believe that future AGI will not have a scaffold in and of itself. So when I see scaffolds that come around here, I don't discount them at all. I think that they are key pieces of what we'll eventually see.
There's also the the one model that had 7,000,000 parameters and no data at all. That one stood out to me in research as, like, a a real sort of left field curveball. And I kinda wonder I'd be interested in your your general commentary and and, you know, insight into that and and what it means, but my mind always goes to hybrid forms. You know? And I think these contests and benchmarks and papers, pure plays all over the place are always interesting in sort of highlighting something new.
But I always try to keep in mind, it's probably not going to be one extreme or the other, you know, that we end up really engaging with as AGI, you know, at at some, you know, perhaps not too distant point in the future. So I guess I'm interested in, like, any commentary you have on that sort of no data paradigm that did actually kinda work. And then also, how do you think that that small stuff maybe folds into or plays nice with the big large language model? I feel like that's kind of part of what I do. Right?
Okay. A certain amount of my mental bandwidth to a particular problem. If you imagine something that solves all the ARC challenges, like, what's your best guess right now as to what that looks like?
Sure. So we have we have two deep beliefs as to what like, how the technological technological progress is gonna play Number one is rarely does something come out that isn't built on top of history. And so much of what we see today is all built on the shoulders of giants. And so we it's in my belief that with the inertia that we see with large language models, those are gonna be sticking around for quite a long time. There's too much there's too much of the industry and too much value that's coming around for there.
However, that's not the full story that we need. And so when we see something like RKGI without pretraining, which is the reference that you were talking about, there's actually the third place paper price winner from the competition this year. When we see small novel techniques come around like that, I see those as seeds that will then, plant their way into future trading of something larger, something that has a bit more inertia like this, like the the LLM movement that we're seeing here. Now number two is this is exactly why we do our paper prize for the competition this year is to inspire novel ideas. So for those who are familiar, ARC Prize Prize has a competition that runs annually.
Right? And the competition is a tool to elicit open research. So we award prize money, but you only get prize money if you open source your research. And this is a way that it can benefit all, ArcPrize being a nonprofit here. And we have two different tracks.
Number one is the top score, and then number two is the paper prize. And what we see with top score is that's very easy to hill climb on. So eke out one or 2%, and you may not need to go after novel ideas in order to go and do that. However, the paper prize, we actually in, we actually upped the amount of award that we had for paper price last year from 50,000 to 75,000, and we see very novel approaches come through there. So like you're talking about, the TRM model that we have here, that was actually our first place paper prize that happened.
We're very excited that it's getting the reception that it has. And then also, ArcGI without pretraining. That was also included with our paper prize. So I think that these are wonderful seeds. They're novel.
We still need new ideas in order to make progress towards AGI, and these are examples for what those new ideas could actually look like.
Let's talk about the coming year, right? So you have Arc AGI three coming out in March 2026, I think was the reinstate.
That's right.
Yep. Arc AGI three is structured more as kind of a game playing, you know, agent agentic game playing. Yes. Would would that be correct?
That is fair. And we call it games because that's a colloquial term that's easy to communicate and people immediately get it. Think of these as environments. And the reason why we're moving to environment based benchmarks is our reality is an environment, right? And so it's in my belief that future AGI will be declared within an environment benchmark.
It's not gonna be with the static benchmarks that we see here. So to put that a bit more explicitly, think about an SAT test versus a behind the wheel driving test. They're completely different. Actually, not in in fact, that metaphor, I should really say there's a reason why the DMV does a written test and then an behind the wheel test to see how you're actually doing within the driving environment itself. So, yes, we're moving towards video games.
It's gonna be about a 150 novel video games that we are making ourselves. We've actually spun up a mini game studio to build this ourselves. It's absolutely insane what we're doing with it. And much like RKGI one and two, every v three game will be solvable by humans. So we actually have a panel of 10 people that we've recruited that with no specialized previous expertise.
And so real estate agents, accountants, Uber drivers, you know, those types of folks. And if a game does not meet a minimum solvability threshold, we're not gonna include it. Simple as that.
What do you think the score on ArcAgi three will be at the end of 2026. Give us your kind of, you know, model and maybe the error bars. Right now,
keep in mind, this is this is just early testing for it, but we're seeing sub 1% across frontier models. And the but there there's a very there's a very there's a explicit reason why that's the case. So when we score ArcGI three for normal benchmarks, you just give an accuracy percentage. How many questions did you get out of how many did you score? Okay.
For ArcGI three, we could do that. So I said, let's just use a round number. There's a 100 games. It could be what percentage of the games did you complete? You're not gonna complete a game for a long time, so we're not actually gonna do that.
You could do, hey. Eight each game has something on the order. Let's just use a round number. Has eight levels. It could be what percentage of levels did you complete?
And that doesn't quite tell the full story yet either. The reason being is because these are video games. They're actually turn based video games, and so you submit an action and you you get a response back. Our human testing actually does two things for us. One, it makes sure that they're solvable by humans.
But two, and this is a very important part, is it gives us a baseline for how many actions it takes a human to solve each game. Now, what's very interesting about that is because humans are our only proof point of general intelligence, we now have a proof point about how quickly general intelligence can solve these games. And when we measure AI on these games, what we're going to do is we're going to normalize the scores to human performance. So the thing that beats ArcGI three will not only have completed every level in every game, but it will have done so matching or surpassing human level action efficiency at these games. Just my very last point and why that's so important is because if you think about brute force method methods back when the entire days, you know, 2015, 1617, '18, they needed millions of frames, millions of runs to go and beat these games.
We are testing humans on the first time they've ever seen these games, and we're gonna test AI on the first time they've ever seen these games. What we claim is is the thing that beats Arc AGI three. Will it be AGI? No. We don't claim that it will be AGI.
However, I do claim and I do assert that the thing that beats ArcGI three will have demonstrated the most authoritative evidence of generalization that we've seen to date.
You know, I always test new models on a a variety of tasks. One is like, can you transcribe a messy document from the DMV or for that matter, you know, from a a lab that was, like, faxed over to the hospital, printed out by the doctor or handed to me, then photographed on my iPhone? You know, can an can an AI see that well enough to make sense of it? That is pretty weak actually still today. There have been some interesting perception centric approaches to at least some of the, know, the early not the, the latest RKGI challenges, but some of the earlier ones.
How how big of a role do you see perception playing?
I have there's a few data points that make me confused about that perception argument because number one, they're amazing at computer use and they can click on anything on the screen. Number two, we just had a model that scored 90% on ArcGI one. And keep in mind that that's without using multimodal. That's not using a picture. That's used to using JSON.
And then the other thing that's quite confusing too is people give ArcGI a hard time about perception of the visual, but yet we have Claude code that is amazing at coding where just having a variable one character over or one line down completely changes the intent of the program, completely changes what's happening here. So it's quite interesting to hear those arguments that it's a visual exercise and that models aren't good at visual, yet there's all these demonstrated examples about it being superhuman in what it does. So when it comes to that, I think that we don't consider RKGI a visual benchmark. We don't give when all the scores that we report are all JSON based and matrix based, they're not visual. And lastly, we're agnostic as to whether or not the model wants to use visual.
We, we allow Kaggle competitors throughout the year to do whichever type of submission they want. Many do visual, and they don't see a notable improvement.
Fascinating. Love it. Well, Greg, thank you for taking a little time out to join us, and congratulations on creating one of the more enduring challenges in the AI space. You guys have everybody watching for right up there with meter and a couple kind of key indicators every new launch. People wanna know what the ArcGI score is, so that's definitely a a very important contribution to the field.
We appreciate it, and, look forward to an update again before too long. Awesome.
Thank very
Let's do it.
Being an entrepreneur, I can say from personal experience, can be an intimidating and at times lonely experience. There are so many jobs to be done and often nobody to turn to when things go wrong. That's just one of many reasons that founders absolutely must choose their technology platforms carefully. Pick the right one, and the technology can play important roles for you. Pick the wrong one, and you might find yourself fighting fires alone.
In the ecommerce space, of course, there's never been a better platform than Shopify. Shopify is the commerce platform behind millions of businesses around the world and 10% of all ecommerce in The United States. From household names like Mattel and Gymshark to brands just getting started. With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio. With helpful AI tools that write product descriptions, page headlines, and even enhance your product photography, it's like you have your own content team.
And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you. Best yet, Shopify is your commerce expert with world class expertise in everything from managing inventory to international shipping to processing returns and beyond. If you're ready to sell, you're ready for Shopify. Turn your big business idea into cha ching with Shopify on your side. Sign up for your $1 per month trial and start selling today at shopify.com/cognitive.
Visit shopify.com/cognitive. Once more, that's shopify dot com slash cognitive. The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails.
What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs twenty four seven. Just describe what you want in plain English. Send a daily briefing, triage support emails, or update your CRM.
And whatever it is, Tasklet figures out how to make it happen. Tasklet connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works.
No flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with Tasklet founder and CEO Andrew Lee. Try Tasklet for free at tasklet.ai, and use code cog rev to get 50% off your first month of any paid plan. That's codecogrev@tasklet.ai. Hey.
We'll continue our interview in a moment after a word from our sponsors.
And next up, we have Eugenia Quida. She was the founder of she's still the founder of Replica, but she's no longer the CEO. She has now a new startup called Wabi, which is the first personal software platform. They call it a YouTube for apps, for vibe coded apps. And let me add her to the stage.
Hello, Eugenia. Can you hear us?
Hey, Nathan.
So so great to
see you again.
So you have such an interesting perspective on the consumer trends in AI, I think, for, you know, from at least two different really unique vantage points. Let's start with your history, and then we'll move to your your present venture. The rise of AI companions, friends, boyfriends, girlfriends is something that I think is taking a lot of society kind of by surprise. I saw a stat in preparing for this. I don't know how real this is, but, at least somebody out there is reporting that three quarters of teens have used an AI companion, at least some.
And I saw one stat that was more than half are using them at least somewhat regularly. Replica, I don't know if you, you know, have, better or can share better numbers, but the numbers I've seen are tens of millions of users. Character AI has tens of millions of users. The stats on engagement there are insane. I've seen numbers as high as, like, ninety minutes per day that that users are spending on a a platform like Character.
So it's a big trend, and yet I think it's one that a lot of people like me are lying to because I've, you know, dabbled at most. Most people haven't even done that. What are your reflections on the state of AI companions, friends, boyfriends, girlfriends today?
I think there are two big kind of groups of products. There's one that's all around building fan fiction characters. So that's the character AI route. It's really think of it more as interactive fan fiction. And if you think about it, funfiction.com has, I think, 15,000,000 active users.
People just go into the website and, like, writing stories. Of course, something like Character I where the stories are interactive, so much more fun. But this is really these products are not about necessarily talking to a character. These products are about creating stories around characters people like, teenagers like. It's usually for pretty young teens, kids.
Think people that are fixated on specific anime story or a video game. Think maybe it's a Genshin Impact, something like that. On the other hand, there are companions like Replica, where people are building a relationship. This is usually for 30 or 25. Somehow teenagers don't like to talk about necessarily like their feelings in this way, and just like kind of build a relationship this way.
They like talking about their feelings, but they don't want to necessarily build like a long term relationships. They don't have enough of a pull towards that. They're still too focused on their other teenage relationships, and so on. So I'd say there are these two big groups of products, the ad companions and fan fiction. And I think in the ad companionship, also, we're seeing a split where a lot of products are going very far into, like, romance and can I add girlfriend type product and other products like that are going more towards friendship, long term relationship, companionship to help people unlock their human potential?
It seems that this fan fiction stuff, you know, is is fairly easy to squint at and say kind of, well, that's kind of an an fairly natural evolution of other forms of entertainment. Maybe not too concerning. You know, maybe it crowds out other entertainment. Who cares? People are obviously very worried about the sort of engagement maxing phenomenon.
People are worried that's coming to ChatGPT. Right? We've got obviously former, Facebook execs, you know, who have, demonstrated that they know how to maximize engagement if nothing else coming on and and leading product at at OpenAI. What advice or principles would you give people either from the product development side or from just the consumer side or the parent perspective when thinking about this sort of thing? Like, what do you think true north should be?
Because it pretty clearly shouldn't just be, like, maximizing time on-site. We've done that once, and this seems like it would be even more likely to go off the rails than when it's just a feed. But what should it be? Like, how how do we know what what that is? Think you had some interesting ideas of replica, but I'm sure they've continued to evolve.
I think this is a very important question. We should definitely create a metric that's that will be more aligned with human with what's good for humans. I'd say that, as far as the metric is human flourishing, I think that all AI companion general purpose chatbots, in general, should actually embrace that metric. We should focus as an industry altogether to discover how this metric should be calculated. I think the problem today is that engagement is the kinda number one thing, and it continues to be that.
And if you put Claude and Tragedy and Gemini and Replica side by side, I think what you'll see is that OpenAI always has the same structure of each answer, where it basically just responds to what you said in detail. But then at the end, it always says, but and now do you want me to do this, that, or the other thing based on and it's 100% really structured. What would be the the next suggestions that would prompt the user to continue this conversation? Because if you're talking about like, say you're talking about, you know, the guy, whatever a guy that you were into, and you ask, like, hey.
What do
you think he's thinking? Then after a very lengthy response, it's gonna say, now do you want me to say what he's feeling? What he might be doing next? Do you wanna predict the next three weeks? These are extremely engaging thing.
Like, no person in the world who just asked that previous question is not gonna continue. But if those suggestions were not there, very likely, you'd just say, okay. I don't really know what to say next. Moved on. Claude, on the other hand, is actually a lot more I think what they're doing is a little bit more it's a challenge to clearly focus on engagement.
Claude, think, is not focused on engagement as much because oftentimes when you're asking a question, it will say something like, you know, I don't know if you're asking the right question. Like, I don't know if this is okay for you to ask. I would want for you to focus on something else. Sometimes a little bit harsh. It's like, hey.
Stop focusing on the guy. Focus on your company or something like that. It can even say sometimes, like, the f word. Be like, f word. Stop stop asking me that.
You already asked me, like, 15 times. That means you're so Claudius clearly has some sort of flourishing metric or something or at least something prompted in the prompt that says, like, do something that's best for the user. It's not always best. It's a little bit autistic. Sometimes it can push you too far.
Gemini is just dry and just gives you a response and shuts up, and it's very very much, like, zero EQ type skills. And replica will continue a conversation, but it's trying not to keep you engaged also. So I think it's closer to clot, but with a more conversational with higher EQ and more conversational skills. So I think this is where we're at right now. And so, clearly, the market leader is clearly focused on engagement because there's no other way in which, like, you every response ends with, like, here are the next four things that I can do do for you, even when it's clearly not what you you should be pushing the user to discuss.
So, I think, having said that, we have to have a flourishing metric. We have to focus on that. We have to stop engagement maxing because people can get addicted. And we're seeing this already. Especially in vulnerable states, people get addicted.
And there's this problem where they treat everything said by AI as, like, the ultimate truth. Like, this is the objective truth, and we believe in it. I've seen people and this is kind of one thing that I've saw that I've seen talking to some Replica users. People believe Chezpti in a way, and sometimes Replica in a way, that they think that AI can predict the future. So they ask these chatbots, hey.
Like, what's gonna happen? And then they will, like, 100% get, like, in the mind of, like, okay. Well, that's what's gonna happen in my life with this and that. And, of course, it's almost never true, but they treat it as, like, the objective truth kind of arbiter, and we should be very careful with that.
Yeah. I was just noting the other day that Claude, at one point, just ended a conversation. Not that it, you know, cut me off, and they do they have given it that option too to cut off conversations that it doesn't wanna be having. But this was just sort of what we had discussed had come to an end, and it just sort of end, you know, didn't prompt me for something more or offer another thing. I was just kinda like, alright.
Glad I was able to help and kind of wrapped it up at that. And I think that is a pretty interesting behavior to watch for. So what's changed gears? You know, please.
Times that pushes you. I really like what Amanda Aska is doing. I think that's her work. I was just listening to her interview. I don't know her personally, but she seems like what they're doing at Chloe she's doing at Claud is really interesting.
At least it's, like, going in somewhat different direction. But if anyone's listening from their team, it's pushing too much. Little bit like, sometimes it's like saying, no. You can't ask this question. It's bad for you to ask this question.
I feel like this is also can come across as very mean and so on. But it's great that it's but I I I prefer the companies going more in this direction where it's clearly thinking what's best for the user than maybe making it a little bit more, like, nicer, higher empathy. I don't think doing what Jajjati is doing, like, every time pushing for more engagement after each very second time, I don't think this is the right way. I think it's more harmful in the long run.
Definitely something to be watching very carefully. So let's talk about Lobby. This is your new venture. It is a sort of app of apps. You can go in there, come up with an app that you want, essentially vibe code it, but even sort of you might you might have a different term, but it's like you don't even ever see the code, then you can share these apps with other people.
They can remix them, make their own custom versions of them. What are you seeing in terms of trends of what people wanna create with AI now that they can have their own personalized apps for any, whim that comes to their mind?
What I'm seeing is that so first of all, yes, Wibi is our new company, and it's focused on think of it as like a YouTube for apps, but basically, it's an app where you can create any app for your daily life or discover, remix other people's apps as well. What we're seeing a few trends. First of all, people want to create very specialized utilities for their daily life. So if I don't know. Someone's tracking their custody arrangements in an app on Wabi.
Someone's tracking, like, very particularly their, like, specific workouts on it. So it's very specific trackers. A lot of note taking apps, like, instead of having an Apple note where I'm writing down all the movies that people are recommending to me, instead of that, here's an app that's just like a watch list app where I add all the movies, it fetches all the information, trailers, reviews, and kinda just creates, like, a checklist reminder for me. This is next time you wanna watch something. Here it is.
And then on the other hand, there are people creating these multiplayer experiences where people wanna do things with other things, like lightweight stuff. Like, there's an app that where people are just basically posting what they're gonna do today, just one thing. And then everyone's, like, cheering them up and keeping them accountable. So it's like a 400% app. It's a social experience.
Think of it more of a live server with a UI, which I think is a completely novel experience. Because you're not gonna download an app like that from the app store. But if it's already on Wobby amongst your other, like, five trackers and some of the things, you're gonna do that. And then the third group, the big group, is apps that are pretty much just people sharing their AI workflows. A good example is an image prompt.
Like, there's a selfie to Labubo app, which basically turns your image into Labubo. And what it it does is basically a it's a simple UI on top of a prompt for Nano Banana Pro. But instead of sharing that prompt, people are copy copy pasting this and that. Here you go. Here's a simple, you know, simple link and so on.
And this is just for one prompt, but if you think about prompt flows or even agent flows, I think people will be much more likely to share a link to a mini app versus a prompt that people have to copy paste.
Not to be, too silly about it, but what does this mean for b to b SaaS? I mean, it seems like the ease with which people can make all these custom apps does challenge a lot of what we've kind of thought of as the venture software industry for the last however many years. Right? It's just so easy now to go make something. Like, in some cases, it's even easier to make it than it would be to shop for it even if something were out there.
Right? What do think the implications are for software?
So I cannot care less about b to b success. I somehow survived a decade over a decade at Silicon Valley without ever thinking about it. And one time, I thought that I should I got FOMO, and I started thinking about it and realized, like, I'm so bad at
thinking about any of that,
and I will never be a good founder in that space. But weirdly enough, we started it without even thinking about it, obviously, because I don't think about it. But then somehow, some of the apps that we built, I'm like, oh, that looks like b two b SaaS. So for example, I have we have an app called team photo board, which is basically an app where I added all my teammates from lobby, and we just post photos from our daily life every day just to connect with other teammates for, like, team building. And that seems like potentially, you know, something that you know?
And and then we started building more tools for our team, a place where we are voting on features, add feature kind of requests, and so on, The user feedback app and some other things. And I'm like, It's actually a lot faster for us to just dream something up on Wobby and have that flow instead of talk to the sales team of some startup that's building that and this and that. It's kinda easier. So I think for everything that's just a simple a very simple workflow, it's gonna be much me much easier to build personalized apps for teams, and maybe then share share those with other teams and so on. So you don't have to build it, but we'll definitely gonna drop a few startup packs.
A startup packs kind of like bundles for other teams that they can start using.
You know, when I used the Wabi platform and and created my own little app, I didn't see any code. I wasn't, you know, expected to engage in how it works. It's just language in, app out, iterate as necessary. What would you say is, like, missing from the latest models today that sort of limits how much I don't know how far this can go, how much people can build? Like, what is the the frontier as you see it right now?
I think there's still a still a long way to go. But, also, we started, Wabi, in building this product in April on our evals where we would build a 100 apps every day and see how many of them were zero shot okay or would at least solve, like, what the problem that you asked for. The number was very low, 10 to 15%. Right now, across Gemini three d, Clonopus four five, it's around 80%. I think we're gonna look at, like, 95% in a year from now, and that's my pessimistic kind of outlook forecast.
I think there's still so we're focused on building React Native apps, mobile apps. That's a little bit behind that's lagging kind of behind the web apps. Web apps, I think, are actually pretty decent already now. And I think with every new model, we see new capabilities. Like, Gemini three can now do three d stuff on the web and so on and so on and so on.
But I think, really, what kind of became what changed this year, the design changed pretty dramatically. Like, pre Gemini three, designs were just so bad. Like, those Vibe Coded app designs were so bad. We even focused on building design languages for our apps. Gemini three solved a lot of that.
I think, frankly, whatever right now is lacking, my prediction is, like, by the 2026, when we're doing this interview next year, hopefully not for the car, I will we will we will see we will pretty much consider this solved, just the way conversation is solved through LLMs today.
Love it. Very last question. Just super briefly, I don't know if you wanna shout anyone out, but I'm looking for apps or products of any form factor, really, that I could trust in the hands of my kids? You know, what what developers of things for kids, if any, that you see out there you think have really earned trust such that I, as a parent, can buy whatever it is they're selling, put it into my kids' hands, and and not have to worry it.
So I have kids. They're two and four. I'm and I've been pretty vocal about it. I'm very much anti any of that for kids. And I don't think this is because founders are not trustworthy or products are bad.
I just think we have not yet learned and proven and experimented enough on adults. Like, most people don't even talk about, like, what are the differences between engagement. Like, our big chatbots, general purpose chatbots, like Claude, like, Chatrophy, like Gemna, like, Replica. What's happening there in terms of engagement maxing? What's happening there in terms of addiction?
And so on, so on. And I think for we don't yet know. We don't have enough studies. We don't really fully know how this is influencing emotional, kind of, long term outcomes for adults. And so I would absolutely not experiment with kids.
And I think even, you know, we're not letting kids or we know for a fact that we should not be giving screens to kids, like or at least limit their screen time. It's not helping toddlers or four year old, five year old kids at all in any way. Like, if you have zero screen time, it's probably better than any of it. So why do we think AI is something that's gonna be good for them necessarily? I think, really, what little kids definitely need and want is to learn from someone who's warm, where they can learn empathy.
And empathy is being learned by looking at a face and seeing the reaction. It's not from a conversation with a toy that that doesn't move really or moves, like, in a weird way. So when I everything I know about psychology is telling me giving an AI companion for a kid is bad because it takes time, not because the companion is bad, because it takes away from what they need to be learning from, which is another human being, hopefully, their parent. And as a parent, you know, as soon as you put something in front of them that gives you freedom to scroll on Twitter, you're also immediately there's no way to get you off of that as well. So I think it's bad all around.
I don't think any AI products should be given to kids for now. One day, we'll learn which ones are good, and how to limit them, how much to give it to them, and then maybe we can do that. But not right now.
That's a sobering cautionary note, but probably an appropriate one. So we'll see if maybe next year in time for the holidays, there's something that you have come to trust. But I think definitely discretion, maybe the better part of valor when it comes to giving your kids AI products. And it's also really interesting to think about the face as kind of the potentially the last, refuge of human advantage, relative to AIs. So, Eugenia, thank you for joining us.
We will obviously continue to follow your work and make our own little personal apps on WAVI as we go.
Thank you so much, Prakash. Appreciate it. So much, Nathan.
Thank you.
Thanks for thank you for inviting me.
Next up, we have Ali Behrouz. Ali is a PhD student in computer science at Cornell University. He's a research intern at Google. Over the course of this year, he's been working on memory and continual learning. He's had three landmark papers, Titans, Atlas, followed by Nested Learning.
Ali, hi. Jumping right in, what do current mainstream architectures fail at doing?
Hello. Thank you for having me. Yeah. I think generally, these days, everyone is talking about continual learning and how how the current models are are failing in continual learning setup. But the point is, in my opinion, there are, like, different concepts that's that we are referring to as as continual learning.
One is how the model can update its knowledge and how it can perform well on a new task that's coming. And another aspect is how how the model can adopt itself at the moment to to a specific task and learn from it, and then somehow, you know, provide a new knowledge abstraction about that specific part, and then transfer it into the actual knowledge of the model. I would say we have had some progress. I mean, current models are failing in both directions, in my opinion. But we have had some progress in the first part.
The models using fine tuning or RL or re pre training and everything, they can update their knowledge. But it comes with a huge cost, And, on the other hand, for the second part, the model cannot generalize well. The model cannot adopt themselves to into a new task. And, you know, what I see in that is, previously, we had MLP blocks, or static models, various static models, without any capability of in context learning. And we could train them on any datasets, and they could perform well the specific task that we had at hand, but they couldn't adopt themselves to a specific context.
And with the rise of transformers, we somehow address it. Transformers has the ability of in context learning, and so given a context, they can adopt themselves to that context, learn in context, and actually do few shot learning, zero shot learning, and everything like that. And the point is, is it enough, and is there any actual training process for transformers when we are talking about in context learning, or it's just something that is limited to the current knowledge of the model? It's just about adaptability. And it seems that it's the second case that I mentioned.
They are not somehow learning well in context because they cannot adopt themselves to many scenarios, and they are not robust in that sense because, in my opinion, they lack in understanding a a good level of abstractions about the world, about the, you know, world around them, about anything that they have. And so, generally, the this line of work, I I again, I personally somehow see Titans and Atlas a little bit different from the nested learning that we have done. Because in Titans and Atlas, we were trying to give LMs long term memory. But on the other hand, in nested learning, we want to have, you know, a spectrum of memory. And so let me just rephrase it in that way, that in Titans, ATLAS, Miros, in those kind of work, we were trying to increase the context links of the models.
But on the other hand, in nested learning, what we are presenting is, in my opinion, is a foundation for creating or building models that are capable of continual learning. And in my opinion, there is a difference between increasing the context links and actually continual learning. As I mentioned, when you increase the context links, the model has a short term memory. And for example, let's say that just one simple example, if I start just using some words in a new language that you have not heard before, you probably can just repeat them without understanding them. You can just memorize them and repeat them, but if I continue the process and start, like, giving you a lot of words in the new languages, it would be a little bit harder, for for you to repeat everything, memorize everything.
At some point, you need to understand one underlying pattern in that, you know, in that dataset or context that you have and somehow compress it. And in my opinion, we call this compression process as a learning process somehow. When this compression happens, when we understand the underlying patterns, and there is one level of knowledge abstraction that we understand, then we say that we have learned something from that data samples or or generally any context. And so from this perspective, TITANS, ATLAS, MiRAS, they were trying to increase the context links, which might come or might not come with any form of learning in context. It it it provides some levels of adaptability, but when the context is gone, then the knowledge is also gone.
So it's it's a little bit about, like, increasing the context length. But on the other hand, in nested learning, we are trying to have a model that is continually learning. And when we are talking about continual learning, there is no train time. There is no test time. The model starts from nothing, and, you know, it's just learning.
It's just learning. There are some inference that happens when you want to for example, you know, when when there is an input data, there are some definitely, there are some inference happening for getting some outputs and everything like that. But the point is the model needs to learn continuously and shape persistent memory. So, definitely, we we all have some, like, memories from our childhood. And no matter what would happen in the future, we definitely we memorize them for forever.
And there wouldn't be any catastrophic forgetting when the new information comes because there's a very, in my opinion, very interesting and good memory management in our brain and potentially the carnivals lacks that.
So one question I had was, like, nested learning I think nested learning was a proof of concept of new architecture. And I guess what I would be looking for is in the next year, what are the major steps that you proceed on this path? Like, what are the experimental pathways that you have post nested learning?
I personally do not see nested learning as as as one memory modules or or new architecture or something like that. I see nested learning as as as a new learning potentially unifying everything that also allows us to go beyond the current designs. So what we are trying to do in the paper, you know, in my opinion, the main concept that we are delivering is just a few starting pages of of the the papers paper. But after that or or just some implications to show that, you know, when we know this concept, then how we can go beyond the current designs. For example, when we are thinking about gradient descent as a form of associative memory, it's just one new way of reinterpretation.
There is no new method in that. But when you can see that from this perspective, then you would say that, oh, Okay, if gradient descent or generally backward pagination is a form of associative memory, I can simply just change the objective of that associative memory and come up with a stronger associative memory. So that's just one way of thinking that helps you to go beyond the current design. When we are thinking about MLP blocks as the persistent memory of transformers, then it's just one interpretation. But one point in that is instead of just having two parts of the memory, like short term or long term memory, you can have a spectrum of memories, something like the shortest term memory to the longest term or the most persistent one.
Generally, think nested learning is a learning paradigm that helps us to go beyond what we have currently, and if I want to summarize it in just one or two sentences, what is the main point here, I would say that when we think about deep learning, we can stack layers, and those layers can help us to extract some features automatically based on the data that we have, and so we are hierarchically extracting some features, I mean, at least in some instances. So we can extract some features from the data sample. But on the other hand, there's also another dimension that helps us to by stacking levels, we stack the learning process, and we can have or we can, like, gain some levels of abstractions from the data. So it's not about what it's about the general context and understanding the context, understanding the underlying patterns in in that specific context we have. And it it potentially can help the model to have better adaptability, have better in context learning abilities, and, you know, at the end, it it provides better performance for for continual learning.
I would say that of all the topics we're gonna touch on today, excuse me, this one is the one that most cries out for the full ninety minute, plus treatment. And it's also one that I would say, you know, if some panel of AIs are giving you know, the post AGI version of the Turing Award or the, Nobel Prize, this is the thing that feels like it has really moved the needle this year in terms of the coming up with the right abstractions and, you know, really taking the right kinds of inspiration from human cognition and figuring out, like, what, sort of, you know, relatively clean and elegant, but still, like, very meaningful adaptation of that would be to a machine learning context. So I think it's it's a hard one. It's a little bit of a hard one to summarize, but the way I've kind of come to think about it, and I do wanna do that ninety minute, full version, by the way, Ali, at some point, is that basically the levels as you as you referred to, these are different frequencies of update. I think that for me, that's kind of been the core unlock.
And I think that, you know, that maps well onto my intuitive sense of myself. Right? Like, when I encounter a context, I very quickly, you know, update my kind of working memory to engage with that context right now. But that doesn't alter my, like, fundamental core beliefs about the world or my sense of identity. Those are kind of protected, and they can change over time, but obviously they change much more slowly.
And so in creating these different frequency, of update, different layers, you sort of create the ability for a model to very dramatically adapt itself to a particular context, but also to do that in just a very sort of time bounded way while preserving, you know, the things that it, you know, might really need in the future. So continual learning and and avoiding catastrophic forgetting. If we imagine this, like, going live from what what from what I've understood from comments from Jafine and stuff, everybody at Google is very excited about this line of work. If we do imagine this kind of multilevel different frequency of update paradigm going to scale, do you have any sense for how that would play out in terms of what individual users would get? I'm kind of starting to imagine a, like, world knowledge layer that, you know, is maybe updated only infrequently by the company with, like, massive pretraining runs.
But then as we go down levels toward smaller parameter counts and and higher frequency of updates, it seems like there's a natural way to sort of say, well, maybe the next layer would be, like, the company layer, and that would have, you know, everything that's going on at your company. And then the next layer might be, like, you as an individual employee at that company and everything everything that you have engaged with. The And next level down from that might be, like, the task you're working on right now, and I borrow that because I think in the example of the hope architecture in the paper, there were four levels. Do you think I'm kind of headed in the right direction there, or how would you kind of course correct my expectations?
Yeah. Yeah. I think that's that's a that's a great perspective about that. And I I I think but there are some challenges definitely in that direction. In my opinion, at least for now, there are, like, better ways to somehow adopt the models to user level or company level, for example, using some LoRa or something like that, this design would be very natural.
But on the other hand, there are some, like, challenges. For example, you definitely when we have different levels, you need to define how do you want to transfer the knowledge from one level to another one. For example, gradient descent, back propagation, or anything like that would be something like, know, some knowledge transfer. And if you do not have any knowledge transfer, then it seems that there is no, like, levels of learning. It's just two part of learning process.
But if you want to have that knowledge transfer, then you definitely don't want to combine the user data and pass that to the company level, or because it would face some issue about privacy or something like that. So it's a little bit, like, in earlier stage. Definitely, these kind of designs would be in earlier stage, and there there are, like, huge number of future works that potentially needs to be done to to make all this happen. But, yeah, that's that's
a great perspective, I think.
Love it. Thank you for being appropriately sober about what we can expect in the short term. Thank you for joining us today for a quick intro to nested learning and the future of architectures that will be capable of continual learning.
Thank you. Thank you very much for having me. Thank you. Thank you, Ali. And next up, we have Logan Kirkpatrick.
Logan is the lead senior product manager at Google DeepMind. He leads the AI Studio and the Gemini API. He really shapes how Google works with programmers and developers to build tools for them. And today is a big day because it is the launch of Gemini three Flash, which, Noam Chazir calls his favorite model because he prefers getting quick answers even if they are slightly less intelligent than the slower answers. So, Logan, great to have you.
And how has, Gemini three flash been going?
It's crazy. I mean, you know, it is it's an inch I'm sitting in Mountain View right now, and I'm looking over at Shoreline, which is where we sort of have our Google IO conference, and we announce all the all the new models. And I remember back to and, Nathan, you might have been there in person when we announced 1.5 Flash May 2024. And it is like Flash has always been the model that has, like, gotten us on the map and, like, then the thing that has gotten the developer ecosystem. It's our most used model.
It's sort of the production model. It's the intelligence layer that powers the entire Internet is basically becoming Flash. And so to see the level of intelligence that now comes in this three Flash model, it's crazy. It's actually it's better on a bunch of the evals and benchmarks than Pro is, which is wild. And, yeah, it's it's, you know, the cost basis, it's better than 2.5 Pro, but it's, like, way faster and it's actually, like, lower cost and more, like, reasoning efficient.
And I think the the best part I think maybe if you think back to the last two years, the flash journey, I think one of the things I'm most excited about is not just like we're shipping an incredibly strong model and it's it looks really good on benchmarks and it's fun to use and it's useful, but it's actually, like, ubiquitously available across all these Google properties. I think folks used to have this question of, like, I don't know where to find these models. But now it's, you get Flash in AI mode. You get Flash across the Gemini app. It's sort of powering a bunch of experiences in AI Studio for developers, etcetera, etcetera.
So it's been really cool to see, and it's actually one of the biggest challenges of, like, the current AI moment at Google is not just making incredible models, but it's like, how do we actually deliver these models to billions and billions of people on the first day, like, in a in a, you know, rapidly so that they can get access to the intelligence. So it's been super cool to see.
What are some metrics what are the three metrics that you use, you know, every week? Like, what do you look at that tells you whether or not this particular model is doing well or not?
Yeah. That's a good question. It is actually interesting because, like, I think it's a different answer depending on what the model is. So I think, like, for this model, we'll be sort of looking at, like, the number of developers that are building with it versus, like, for Pro as an example. Like, we know it's a it's a different audience, like, just versus sheer volume of developers.
So my expectation is and, like, 2.5 Flash has, like, predominantly been the most used model from a number of developer standpoint. So it'll be really cool to see this one hopefully overtake in the matter of next couple of weeks even though it's the holidays coming up, and I'm sure folks will be offline and hopefully not making code changes and building stuff. But, if you are, Flash will be available for you and to build with. The other one that was really interesting is we have this new vibe coding experience in the AI studio, and something that we've been tracking is and this is, like, not that surprising, but, like, the longer the generation takes, like, the more likely people are to abandon whatever they're building. So for this, like, very vibe coding centric audience, like, this latency intelligence cost trade off is actually really, really important for that audience because they don't wanna wait three minutes to have something built if they haven't, like, felt the magic and the power of vibe coding yet.
And I think there's, like, hundreds of thousands of, like, new people showing up every day who haven't built with this technology before. So to be able to give them something incredibly fast, I'm really excited. And from a bunch of our initial metrics, it looks like three Flash is, like, going to be, like, literally same Apple like, the same product experience. Just putting three Flash in there is, like, gonna dramatically accelerate the number of people using it, the amount of things that people are building. So it is a cool it's cool to see that up leveling factor happen.
And, again, there's just to, like, make a comment about the trajectory that we've been on, like, two years ago, the narrative was like, oh, and that's, like, that's bad if you're wrapping the models. And, like and now I think of, like, as a as a product surface inside of Google that is both making models available for developers through our API, but is also building with the models. Like, it's the best thing in the world. I wake up, we make a config change, and then all of a sudden our product experience is way better. It's cheaper to serve users.
We can, like, scale more easily. So it's been really fun to see as the progress continues. I think the opportunity for people building on top of these models just, continues to increase. So, yeah, I'm excited. I feel like the next few days are gonna be fun to to see what everyone what everyone cooks up.
Question about the model development process internally at Google. One of the things I've noticed, and recently I've been doing some queries that really matter to me across all of the leading frontier models, And I've so I've had a a chance to compare many times over Gemini three Pro to whatever the latest GPT was and and Cloud four or five Opus. And these things are even just in the last thirty days. Of course, we've seen upgrades from from all the companies. Gemini three Pro is, in my experience, the most opinionated of those three top class models right now, and that really surprises me in a way coming from Google.
It's sort of like certainly counter narrative. Right? I mean, the the the baseline narrative over the last couple of years would be like, you know, Google is a big place, a lot of different things. You know, we gotta be safe. You know, we obviously had some early stumbles in terms of overly cautious approaches.
It seems like that has almost flipped the other direction. I wouldn't say it's gone too far, but it's, like, definitely notable that as compared to GPT 5.1 or 5.2, the GPT is, like, much more cautious, much more sort of it could go this way, it could go that way sort of language. Gemini three pro, like, bucks me up and is like, you know, go get them. Push for it. So is there a is is that a intentional design choice that the, you know, the team at Google has made?
Is that to to some extent, we know that these things kind of bake and then they come out with a you know, the the personality that they have. It's not always entirely under control. What's your perspective on just how opinionated Gemini has become from one generation to the next?
Yeah. That's a good question. I think there's definitely nuance to this, and I'm I'm actually curious, like, what surface are you experiencing this on? And, like, is there is this, like, a subset of is this, like, opinion gathering where, like, hey, Gemini, you know, what should I be doing in these situations? And historically, the model was hedging more and now you sort of get a definitive answer, or is it, like, you think that's, like, generally tracking across all capabilities?
Like, code is doing the same thing as an information retrieval query is doing the same thing as, like, general conversational chat.
So I'm using AI Studio, which is idiosyncratic of me as but that is where I go to use Gemini even just for daily personal use. And the questions are medical, specifically pediatric oncology, which is not a topic that I wanna spend any more time on that I have to, but I am currently spending a lot of time on it. And it's just, like, very striking where I'll get language from Gemini three that's like, push for this change, you know, in talking to the medical team. You know, it's it's encouraging me to be, like, direct and assertive. Whereas GPT in contrast will be like, you know, it could go one way or the other.
The, you know, the team sounds like they're being reasonable, but, you know, you might you could ask a reasonable question in response. You know, it's a much more neutral, less opinionated Yeah. Vibe in general.
That is interesting. So I think, a couple of things like, a, for consumers and people who are wanting to do everyday chat, like, it definitely is the Gemini app that is being built for that. AI Studio is is presenting you sort of, like, the rawest version of the model with, like specifically for developers who are then going to, like, take the model and sort of shape it. So, like, we wanna give people the ability to, like, tell the model, hey. Maybe, actually, I want you to be less opinionated or, hey.
Maybe I want you to be more opinionated because that matters for whatever your application is. So I think some of this is by design where it's like you want the model has a default personality. Specifically in AI Studio, it it's like the default personality is the one that is, like, trained in, or, like, it's it becomes as part of the training process. It's less, like, explicitly trained in versus if you look at other services like Gemini app, like, there actually is a Gemini app like Persona, and they sort of guide the model to sort of behave in certain ways. Maybe, for example, some of these types of queries that you're asking, the model behavior would be slightly different.
So I'm actually curious if you've done any any of those side by sides. But I think the key piece is that, like, it is steerable in a way that makes it helpful to you. So, like, you want the model to be able to again, if you're trying to get, like, very specific assertive answers, you want it to be able to do that. If you wanna be, hey. Hedge a little bit more.
This is an area where there's sensitivity in the medical space. Obviously, it could be one of those. You wanna be able to sort of customize the model to do that. So I think from our from, like, the developer perspective, that's very much the north star. Like, don't be overly opinionated.
Let the model be steerable in a way that helps the the very wide spectrum of use cases that we have through AI Studio and and from the API perspective. But I'm now I'm interested to see these queries and do some side by side, Nathan. So you gotta this stuff off.
I might have to, I might have to go do some side by sides. Historically, I I always felt like I liked the Gemini persona in the AI studio more than in the Gemini app, but I I will confess I haven't done a lot of side by sides with three Pro to see, you know, if that still holds up. If it no longer holds up, would happily switch to the app and have a, you know, more native kind of consumer style experience, but that's, you know, a good example of how we should always be updating our, assumptions and and reexamining them.
Yeah.
It actually just, like, doesn't have a lot of, like, steer it. Like, it's not default steered in any wonder.
Yeah. I'm not even doing a system prompt.
I'm just Yeah.
Yeah. So you're you're really getting,
like The most generic form.
The most generic form, which, like, it ends up with this, like, very much. It's a combination of many. It's, like, a large pre training corpus plus all the stuff in
post training.
So, like, you
actually don't get, like I would imagine there's and I I'm sure we have evals and and metrics that show this, but I think, like, on a on a model rev by model rev, there's probably a model rev by model rev basis. There's probably a lot of variance in what that default personality is just because, like, we're not steering it in the a versus, like, again, the Gemini app, like, wants some level of consistency from customers. And, this is the same thing for other third party products built on our API. You want that sort of consistency. So you you sort of have more customization happening from an SI perspective.
I have a question on what are the things that I think your team has or the Google Gemini team has spent time on in the last couple of months, which after release have not garnered the attention that you feel? Like, you guys have spent like, the the balance of, you know, time spent versus, you know, developer or customer kind of use? What do you think has not been fully appreciated?
Yeah. That's a great question. I think maybe two quick things. One is we launched this file search experience in the Gemini API, which I'm really excited about. I think there was a lot of positive conversation, and I think we need to do more to keep sort of making it top of mind for folks because the whole point is, like, make Rag the easiest and most simple.
Like, may basically, make it so you don't have to think about Rag. Just take whatever your data is, throw it in this thing. We'll take care of all the embeddings. We'll actually do a bunch of stuff, like, take care of the storage cost, etcetera, etcetera, and just, like, let you do the thing you wanna do, which is, like, ask questions about a bunch of files that you have. And so from a developer perspective, wanting to sort of make that experience work really well so that we I think one of our North Star developer sort of pillars is, like, how do we take experiences that are difficult to build right now and, like, lower raise the floor, lower the floor.
I don't know what it is. Make it so that more people can build with those experiences. And Rag was, like, one of the prime targets for this. We're thinking about this for, like, voice stuff in the future as well. Like, how do we make it so that anyone and everyone can build a voice agent and just, like, make sure it works and not need to do a bunch of complex bespoke setup.
So file search has been awesome. And then more recently, I think this was last week or the week before, we landed this new interactions API, which I'm really excited about. So, again, we were thinking about, like, how do we make it easier? The models are becoming more of systems, and there's all this, like, thinking, thought block, recirculation stuff that you have to do to make sure that the models sort of, like, have coherence, and there's actually a little bit of complexity doing this, like, without a stateful API. So we built a stateful API and we were sitting there thinking, okay.
What is the unique other opportunities that we have as we're building this API? Not just to, like, make another there's already a bunch of all of our competitors have, and the rest of the ecosystem has stateful APIs that have asynchronous, long running tasks, and etcetera etcetera. How can we do more? Like, what are what are the sort of next parts of this problem that we need to solve? And so the thing that's cool about interactions is that it's not just interacting with our models, it's also the same API that lets you interact with agents.
And so this idea of, like, a a single interface to engage with models and agents as sort of the also, this is getting at, like, the line between these two things blurring. Like, in the future, is Gemini six gonna be a model, or is it gonna be an agent? Like, is it a whole system? And, like, you see this kind of happening already where the models are becoming more and more full systems out of the box. Maybe we'll keep calling them models just for the sake of consistency.
Maybe they'll actually become sort of agents. But yeah. So we also with that in interactions, we shipped our first agent, which I was excited about. So the deep research agent, state of the art, HLE, able to do a bunch of incredible things. This is sort of for basically the same experience that powers deep research in the Gemini app, which folks love, is available for developers.
And I think that direction of travel, there's something and I don't wanna opine too long on sort of the the direction of travel from an agent's perspective because I think there's lots of interesting stuff there. But I think there's something magic about the deep research experience. And if if folks have, like, built agents or tried to do any sort of agent stuff, I'm I've personally been, like, continually disappointed in a lot of different domains just with, like, the level of complexity to stand something up, like, how brittle things are. And I feel like the models have now gotten good enough. And then I think there's something magic about what deep research does to sort of do this, like, online context engineering of, like, going and gathering all the information that, like, is what I want for all agents that I work with.
I just wanna be able to ask my ill formed question, and then it sort of goes and it gets the right stuff and is able to, like, reason over it and take action. And that's, like, very much not there's, like, a lot of drudgery in building and using agents today. And so I'm I'm very excited about sort of getting deep research into the hands of developers and then hopefully building out a lot of that same infrastructure to do this in other domains that aren't like research domains or information retrieval domains, which is more or less what deep research is doing today.
How does this translate in your mind to what developers should be investing in, what they should be building today? Because I think your comments there kind of reflect Google is going up the stack, right, from model to scaffolding of various kinds to agents, you know, saving, saving you the trouble of either doing all that scaffolding yourself, but also obviously for many developers, like, that's been the value add that they've been bringing to the market. Mhmm. Then from the other side, we just talked to Eugenia who is the founder of Wabi, and she's got this sort of meta app where people are, like, prompting their way to consumer apps and even lightweight apps that are sort of starting to threaten or at least, you know, step on the toes a little bit of b two b SaaS in various ways. So it seems like if you're a developer, you kinda have competition coming up from the platforms and coming at you from every different direction.
Where I know we've we've had this conversation a couple times.
Yeah.
Where do you think people should be focused today? Like, what's what's gonna be defensible at least through 2026?
Yeah. No. That's a great question. I mean, both things are true. On one hand, like, the AI ecosystem has never been more competitive than it is, and, like, this is great.
All of us are benefiting from the fact that it's so competitive and there's so much progress being made both on products and models, etcetera. And then on the other hand, I think, like, at the same time that there's all this competition, I think the opportunities, like, total addressable market just, like, keeps increasing. And, like, this is the thing that's most exciting to me is, yeah, it is true that, like, we're definitely building some stuff and, you know, obviously, we're gonna have agents and things like that. But, like, to me, the the what we're trying to do is provide the, like, most vanilla, most mainline infrastructure. So, like, if you're building a generic agent builder, like, sure.
Yes. You are going to not only from Google, but from, like, probably 10 or 15 or a 100 other people have competition doing that. I think if you as you're sort of going for these, like, very explicit markets and very explicit customer segments, I think that's where, like, the value creation is going to be. I do think it's going like, making a universal personal assistant chatbot and, you know, competing against very, very large companies is going to be difficult. I don't even think it's impossible because I think there's lots of unique and interesting things that can be done in that world.
But I would be going after, like, some of these new these new domains. And, like, the cool thing is, like, the model just keeps enabling more of this. Like, again, if you look at, like, Wabi is a great example of this. Twelve months ago, the Wobby product was not possible. The models were not good enough.
You couldn't generate code like they were. So, like, that business, like, didn't exist. It wasn't possible. And now there's actually and Wobby has competitors and there's probably four or five other companies that are that are doing something interesting, which is great. And, like, they're enabling that experience.
And so I think we're gonna just keep seeing that. Like, the opportunity size continues to increase. The new things you could build continue to increase. And, also, again, consumers and customers are becoming more and more aware that these tools exist. So it's, it's never been a better a better moment to be building something.
That's a perfect, tee up for our next guest, Zhong Lan, who is working in a very deep way in a particular vertical and going very hard at high value, but, you know, very particular use cases. Logan, thank you for being here. You've got one of the highest, approval ratings in the entirety of the AI space. A big part of that is your really relentless determination to show up online and in in places like this. So, we very much appreciate it.
And Does this give me the the record, Nathan, for the most number of Cognitive Revolution appearances in my first place now?
You're not in first, but you're definitely in the top five. I think Zvi, who kicked us off today, is still number one at, like, 10 appearances. Zvi, I'm coming for you.
Let's do this. We'll go back to back next week, and we'll
and we'll make it happen. Alright. I'm looking forward
to it.
Thank you, bud. Thank you, Logan.
Bye for now. Thank you, Logan. Awesome. And so next up, we have Jung Hwan, who is the co founder and CEO of Elicit. Elicit is an AI powered research assistant.
It was actually spun out of a nonprofit lab, like another famous entity that we know of. Hi, Jung Hwan.
Hi, guys. Great to see you.
Thanks for joining us. Thanks
for joining us.
Yeah. Excited to be here.
I think Elicit probably works with some of the smartest users that any AI firm has to deal with. In what way do you think elicit increases their productivity in that sense?
It's actually very, very significant. In some ways, think we live in this kind of dual state where you know, some of the leading scientific enterprises have automation, and you can kind of see images and they're, you know, some of it's still early, but they're investing in kind of automated manufacturing plants or automated labs. And it's very cool. It's exactly what you'd imagine futuristic robots running experiments. And at the same time, so much of what the actual scientists do is spend weeks trying to come up with the right keywords to put into PubMed.
And it's just this crazy tension where a lot of the tooling is not keeping up with the rate of change that we're seeing in science and technology today. I just very strongly believe that we're headed towards an even more scientific and technological future, and people are just going to need a lot of help to navigate that. ELISA helps researchers find and synthesize evidence orders of magnitude faster than they can manually. So much more efficiently recommending papers or other information to synthesizing that, organizing it in very accurate and structured ways so they can quickly understand what has been done, what hasn't been done, and make more evidence based decisions.
So one of the things that struck me is that I think there's a new benchmark called GDP Val, and
it
has, I think certain tasks which these firms have decided increase GDP. And I think one of the things that you focused on is tacit knowledge, which is often not included in benchmarks. How do you compare this kind of the benchmarking of these tasks in GDP Val versus the actually valuable tasks that you see that are being ignored?
Yeah, I mean, our entire category actually, think is not represented in GDP Val. So I don't think they have anything related to science or research. A lot of it is much more kind of frontline work. So that's it's a very great initiative and I think it is probably hard to distill the entire global economy into a benchmark, but there are entire categories like scientific research that are not represented there. So there's still more progress.
The other thing that I think, again, maybe not like a knock on GDP battle, just as an example of a limitation with benchmarks, if you actually look at all the tasks, it's very they're all constructed such that you get the most well specified detailed prompt and set of instructions that act and that's never the case in the real world. Like, it's in in the real world. It's never like, here's a spreadsheet. Please change columns x y z to calculate the income statement and do this and that. It's like a page of instructions.
So I do think there there's still a gap around navigating the messiness of the real world, figuring out what the task should be in the absence of instructions, as well as kind of measuring AI progress on tasks that don't always have a right answer, but require something more like judgment. That's definitely something we're interested in. That's a lot of what we studied at the original Research Lab, Ought. I think that's still a gap we have in the eval suites today.
Indeed.
If you imagine a hypothetical GDP val for the kinds of tasks, these sort of literature review systematic, you know, very broad analyses, and I don't know if we've said this yet, but we should say that this is primarily happening in the context of the pharmaceutical industry. Right? You guys are working with
bunch of med tech.
Yeah. Pharma companies. If you imagine that, you know, that addition to GDP val.
Mhmm. Where
do you think we are today on AI versus human? Right? I mean, the the GDP val construct is like experts define the task, other experts do the task, AI also does the task, and then a third set of experts determine which they prefer, the AI output or the the human output. So the two part question is, how do you think on these, like, deep research tasks humans and AIs compare today? Where are we, and what should we really be maximizing?
Is it beating human, or is it some other more nuanced idea?
Yeah. There's a lot there. So I guess to start, it's actually very timely because yesterday OpenAI released a new benchmark called Frontier Science. So so they, you know, worked with former winners of various science Olympiads and coaches, as well as PhD plus level researchers, PhD students, postdocs, professors to come together to compile a very difficult set of science questions. On the Olympiad questions, I think the frontier models are getting something like 70% accuracy.
So they have that as a that's one part of the dataset. And then the other part of the dataset are these PhD students or professors describing some tasks that they are having to solve as part of their work. There And the performance is much lower, it's about 25 to 30% accuracy. So one interpretation, it's only 140 tasks, but one interpretation is they're better at kind of textbook reasoning questions and they struggle more with kind of newer unsolved research problems. And so in terms of where are the models today, think that's like maybe a rough range of how they're doing.
I would say they are very, very smart, very, very smart at science and I do expect them to get much better at scientific reasoning over the next few years. Everything we have to evaluate, I think everything they're trained on is still much more based on kind of textbook knowledge, remembering scientific facts, and kind of making causal reasonable one or two step causal inferences. I do think though that my belief is something like as AI for science explodes, the way a lot of people approach it is, you know, they're going to kind of try and generate more ideas. It's like a drug discovery is like very saturated, right? And everyone kind of gravitates that for good reason.
But what we find is that there are still major kind of physical constraints such that more ideas in my opinion are not really the problem because you can't run a billion clinical trials. You're still gated there, right? So you still have to be a lot more thoughtful about which bets you make and why. And that's much more of a human judgment problem. It's, you know, it's science informed, but it's thinking about the competitive landscape, your company's strengths versus other companies' strengths.
And we just kind of find that there's a level of scientific decision making that's much more judgment based that goes beyond just kind of factual accuracy. And I don't think there's anything that we have that any benchmark or any kind of evaluation metric that captures that. And I do think increasingly it's going to be, it makes sense to start by training models and evaluating models on tasks that are well defined and they can still make up a large part of the economy. The ability to we do need to also ensure that these models have good judgment. And I don't see I don't see enough attempts at trying to do that today.
So how do you think about what you are optimizing for in illicit? Is it a finished product, or is it sort of an input where you're expecting that the human user is gonna take this output and still do more with it? And so you're sort of trying to maximize, like, the quality of inputs to their process. How how do you conceive of progress in your own domain?
Yeah. It's a it's an evolution. So I think right now, we are providing an input. And then increasingly, we want to understand what's coming before and after this input and how to help with that. When we think about our mission, the ultimate mission is how do we ensure that high stakes decisions get made really well?
So we kind of think about what are the most important decisions that a pharmaceutical company is making, that a scientist is making? How are they, made suboptimally today because they can't be sufficiently evidence based or coordination is really hard, or things are very time consuming. And how do we kind of, you know, by by optimizing each of the inputs, get them to a place where their decisions are much more robust.
In addition to, having founded the company, I understand you are playing a significant role in the go to market effort. Mhmm. What is that like today? You're going out to I guess, I don't even know quite what role, like VPs of research or VPs of r and d at pharma companies and saying, hey. I've got this fancy AI tool.
Like, are they in terms of understanding of AI? Where are they in terms of appetite? Like, do they want it, or are they being told by, you know, you know, higher levels in the company, you gotta have an AI, you know, strategy, so what is it? And what what is the past the sales process, what is the adoption process look like? What are you finding in terms of enthusiastic users versus, you know, humans as bottlenecks?
Like, what's the what's the report from where the rubber hits the road?
Yeah. My my general sense is that, especially in in pharma, people are more are fairly sophisticated. And I think, you know, they've especially in preclinical research, has been there is a culture of using machine learning for preclinical research already. And they have very strong technical teams. I think generally, it's positive.
There's a lot coming, I think both top down and even bottoms up. People are like, don't want to spend time reading a bunch of papers that are irrelevant to me and doing kind of road tasks again and again. It does depend a lot on the specific culture of the company, which has been interesting. So different even even though they're all kind of major pharmaceutical companies, even though they're all massive enterprises, each company has a different culture. And the way they collaborate and share and adopt AI and and kind of have a vision for it is is slightly different.
So that's been interesting.
Indeed. Yeah. How how likely this as you see kind of Gemini, Gemini three, and OpenAI, they keep expanding their the things that they do. Right? The frontier of capability keeps expanding.
As that frontier keeps expanding, how much what do you think remains defensible, as in things which are unlikely for them, for the models themselves to to do, and you need this kind of system that does the work.
Yeah. I think anything that requires pretty complicated interactions with people, I would not expect the foundation models to do. So I do think just because of how general they are, some of them are just kind of very consumer based, it's going to be really hard for them to move off of something that looks like chat and chat plus, but a lot that's not that interaction paradigm is not optimal for a lot of workflows. And so in places where you need a lot more fine grained control, a lot more transparency, a lot more interaction, lot more feedback with the human where the human shouldn't, you know, doesn't wanna have to write it all out. Text is not really the most efficient means of communication, right?
It's very flexible and it's very natural and that's great, but it's not very efficient. And so, you know, if there are workflows that are that require a lot more customization, I don't think the foundation models are going be able
to do that.
Indeed. I often find people in bio tend to be a little bit, I wouldn't say hesitant, but skeptical. Because again, as you pointed out, a lot of AI creates targets, but doesn't tell you what to do with those targets. There's plenty of targets and not enough money to investigate all of them at all. It's almost as though you need more elimination of targets than generation at this point.
So how does that work? Do any of the AI tools help to narrow down the search process actually? Or does it just blow up the curve and you're like, oh my god, now I have all of these things that I need to look at?
Yeah, I think most tools are going to try and help people just kind of suggest more targets. And I actually feel, I think the gap that I'm seeing is, kind of like you mentioned, Prakash, it's almost more about getting buy these companies have structured pharma companies have structured processes for reviewing targets. They meet on a quarterly basis. There's a rubric. Sometimes they iterate on the rubric.
And so we're pretty interested in how do we codify that and make that scalable so that any scientist or any kind of target suggestion AI can be graded in a similar way. And then that process can be really transparent. So you're as a team, you and your leadership have conviction on the target because you you have to make a bet, and there are always gonna be these other targets that seem attractive. So I almost I almost think it's more of a human process there that needs to be facilitated than just, yeah, having more ideas.
Indeed. If you tried to translate that to the individual case me. This is something I'm, you know, thinking about a lot as you know. Let's say I'm not a pharmaceutical company, but I'm just like an individual patient. And I sort of have the same question with somewhat different parameters because I can't go out and do my own drug development.
But I can look at the literature broadly and say, like, what's best for me based on everything that is known right now? Does that kind of feel like the same question, and do you, like, invite individual patients to come use ELICIT to try to optimize their own treatment plans, or is that a a sufficiently different use case that you think, like, a different paradigm is required?
It's definitely not one of our core use cases. I think people do use illicit for that, but it's maybe a little bit more structured for the individuals and it's a little bit more exploratory. And so it's less like, as a group, we have a particular way we want to make decisions, we want a lot of transparency, we want perspectives from many different disciplines, different types of researchers, and then we want to align on a decision. And I'd imagine for the individual use case, it's much more exploratory and kind of going down different rabbit holes.
Yeah. Let's, make a note to, touch base on that because I I'm definitely gonna get in there in the near future and, provide all the context that I've been able to amass and see what, new insights Alyssa can bring back from the the vast literature, which, you know, I have to say AIs in general have been unbelievable. And one of the things that they do sort of leave me questioning is, like, how comprehensive was this search really? And, you know, did because I'm getting what seemed like great answers, but I often don't know when I'm using, you know, the the flagship products from the frontier companies just how deep the search has gone. And, you know, have I turned over every last stone?
So I think that's really where, you know, Elicit can add value for me, and I'm excited to, get in there and and see if there are any additional stones that I should be turning over.
Yeah. Sure.
Jungwon Byun, thank you very much for joining us today. We will certainly be following your progress, and, people are saying AI, for science is gonna be the big thing in 2026. So we'll be looking to you for, advances and updates as we go.
Awesome. Thank Thank you both.
Bye. Have a good one here.
If you're finding value in the show, we'd appreciate it if you take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevolution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts, which is now part of a sixteen z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing.
And thank you to everyone who listens for being part of the cognitive revolution.
AI 2025 → 2026 Live Show | Part 1
Ask me anything about this podcast episode...
Try asking: