Episode	Podcast	Published	Duration	Status

Latent Space: The AI Engineer Podcast

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

December 6, 2025•12,995 words

Description

From building Medal into a 12M-user game clipping platform with 3.8B highlight moments to turning down a reported $500M offer from OpenAI (https://www.theinformation.com/articles/openai-offered-pay-50...

Hi, listeners. As you may know, I recently wrapped up the AIE code conference in New York. And while I'm traveling, I do like to visit top AI startups in person to bring you interviews that you don't find on any other podcast that just does a Zoom call. General Intuition or GI for short is a spinout of a ten year old game clipping company called Metal, which has 12,000,000 users. By comparison, Twitch only has 7,000,000 monthly active streamers.

Metal collects this data by building the best retroactive clipping software in the world. In other words, you don't need to be consciously recording. You actually just have Metal on in the background while you're playing, and you hit a button to clip the last thirty seconds after something interesting happens. It's very similar to how Tesla and self driving does bug reporting if you have ever done a self driving bug report in Teslas. The result is that Metal has accumulated 3,800,000,000 clips of the best moments and actions in games, resulting in one of the most unique and diverse datasets of peak human behavior actively mining for the interesting moments.

They were also very prescient in navigating privacy and data collection concerns by mapping actions to these visual inputs and game outcomes. As you saw on our Fei Fei Li and Justin Johnson episode with World Labs and with the recent departure of Yang Lakun from Meta, there's a lot of interest in world models as the next frontier after LLMs to improve on spatial intelligence and to work on embodied robotics use cases. DeepMind has been working on this with Genie one, two, and three, and CEMA one and two. And this year, OpenAI seem to finally agree because they have been pending on LLMs a lot, and they made news by offering $500,000,000 for Metal's video game clip data. Our guest today, Pym, turned down that money and instead chose to build an independent world model lab instead.

Khozla Ventures led the $134,000,000 seed round, which is Vinod Khozla's largest single seed bet since OpenAI. We're able to get an exclusive preview of GI's models, which unfortunately we cannot show you directly, But I can confirm they were incredibly human like, and we chose to include the first eleven minutes of the demo discussion even though I couldn't show it to you. It may be hard to follow, but I tried to call out what was noteworthy for you to know as your likely reaction if you were watching along with us. Now enjoy the world's first look at my first look at Geno Intuition.

So what I'm about to show you is a completely vision based agent that's just seeing pixels and predicting actions the exact same way a human would. And so, yeah, what I'll show you here is what this looks like four months ago. So, again, this is just an agent that's receiving frames and it's just predicting action. So, you can see it has like a decent sense of being able to navigate around. It tabs a scoreboard, just like gamers always tab the scoreboard.

So these are purely these are pure mutation learning.

I see. This is slicing the knife.

Yeah, exactly. So it's doing everything that, like, humans would. In this case, here's here was the first interesting part that we saw. Like, it get stuck and then they have memory as well, so you see it can get unstuck.

How long is the memory?

Four seconds. Yeah, four seconds for the straight and cut. Okay, so this was four months ago, this was maybe a few weeks after that, so you can see there's like, it's still doing the scoreboard thing, but there's still quite quiet. And these are bots too, you can see that

It's very human, let's just say that.

Yeah. Then, right, so this was really the early days of research where you can see, right,

it does one thing and then goes for another.

And then we've been scaling on data and compute, also we've just been making the models better. And this is where we are now. So what you're seeing is pure

I said, pure mutation learning. This is just a base model. There's no RL, no fine tuning. This model sees no game states. It is purely capable Not sequence, acceptance.

It's purely predicting the actions from the frames. That's it. And this is playing against real humans, just like a human would play. And it's also it's running completely in real time. So there's absolutely everything here plays exactly like human.

Do you give it a goal? Yep. It just figures out it's on a go because obviously it's trained on the same thing. Yes.

And I I picked right I picked the sequence where also it doesn't do well initially, so you can see like this is just like a sequence, a random sequence.

But this is the the I mean, it looks like it's doing well.

So Oh, okay. Yeah. Watch. Yeah.

That's pretty good.

Maybe too good.

This is my my favorite part. So you can see it does something that, like, here,

like Neumann would never do this, then gets unstuck, then has four realizes switch, and then in the distance.

So you're

saying, one, it makes a mistake that a human will never make Mhmm. But it unsticks itself.

Mhmm.

And two, what we just saw is it is doing superhuman things. Yeah. Okay.

Yeah. I mean, there are things that that that Daemon said, obviously, but because it is trained on on the highlights of things that all the exceptional things, it's inherent in those things. Yeah. So it's not like Move37 where we are all their way into something.

It is, yeah, replicating superhuman. Yeah, Or like peak human.

The baseline of our data set is peak human performance.

Yes. Yeah.

Okay, so that's the agent. So now what I'm going to show you is we then are able to take those action predictions and we're able to label any video on the internet using those actions. And so this is just frames in, actions out. Yellow model prediction or sorry, yellow is ground truth, purple is the model prediction. And then bottom left is compound error over the entire sequence.

And then this is reset per prediction.

Reset meaning every now and then you reset?

Yeah, so this just means it resets the baseline. So this basically, a single error in the entire sequence compounds here, but it doesn't compound here, if that makes sense. Yeah. Again, this is just seeing frames, right? It's not seeing any of the options.

And so, we did, right, is we trained it on less realistic games, so we transferred it over to a more realistic game. And then, and this is where it gets really exciting, we transferred it over to a real world video, which means that you can use any video on the internet as free training.

What was it predicting?

It's predicting it as if you were controlling it using keyboard and mouse. So if you were if you're basically playing this sequence as a human.

Is there some sense of error or?

So that's why you transfer to more realistic games first. Yeah. And then you transfer to real world video because you can't get a sense from ground truth from the real world video yet. Let's see. And then so what they don't so I'll show you here.

This one is also this is the same agents that I just showed you.

This is playing against other AIs?

This one's playing against bots, yeah. The previous one was against players. But with the sniper, it doesn't really matter that much, I shall say. It's like, so one one thing that's really interesting is you notice that it behaves differently as it has, like, different items. Right?

That makes sense. Yeah. Intuitively. Yeah.

I think there's also a question about egocentricity versus, sort of third person. Yeah. Does it matter?

The third person, I think, will be very, very helpful if you're, for instance, trying to control multiple objects in an environment later on. Right now, I think having fully imperception first person is quite helpful. This one's also this is the policy itself.

What do you mean this is the policy agent? The agent.

Yeah, same for the strings that I just told you about. Good. Yeah. Like this, it hides, that to me was just incredible. Like, just from knowing, being able to predict.

The

appearance is also high when

you see it. Exactly. Yeah. Yeah.

And it needs the spatial intuition to go, well, this is hiding and that's not hiding.

Exactly. And right while it was reloading, yeah. Okay, so that, so those are, that's the policy and this is a completely general recipe meaning we can scale this to any environment.

Is this work closest Okay, no, let's keep going on demos until I

was going go into research.

Yeah, yeah, sounds good. Okay, so what I'm about to show you are the world models. There's a few really, really interesting parts about our world models. So the first is we actually made the to transfer sorry, we made the decision to pre train world models from scratch, but also we've actually been able to fine tune open source video models to get a better sense of physical transfer. And so one of the things that you'll notice here is like our world models have mouse sensitivity, which is something that like gamers absolutely want, right?

So, you have these like very rapid movements, which you couldn't do in any other world model. And so, this is a holdout set. So, this clip was never seen before at training time. As you can see, has spatial memory. This is about a twenty second ish generation.

And here's what's fascinating. This is an explosion that occurs and you can see that in the physical world, right, the camera would shake and in the game that would never happen. So you see the world model inherits the physical world camera shake, but the actual game never does that, which is sort of that to us was quite fascinating, right? Also to the models that I just showed you that we used to transfer over from video, the two of those combined will allow us to like push way beyond games in terms of training. This is another interesting, so this is a world model, this is rapid camera motion, so like again this is stuff that we're literally just taking one second from here in the context and the actions and replaying it here, right?

And so you never essentially have like, what we're saying is the skill that you see in the clips, that speed and the movement, that also pays off at training time when you're doing warp models. This is my favorite example. So this shows that the warp model is capable of performing with partial observability. So what you're going to see is, again, you're replaying the actions from here and here just using one second of video context. Everything after that is completely generated.

So what you're going to see is the model is going to encounter, in this case, smoke. Normally, now models break down. What you actually see is it comes out of the same place. And so, it's capable of even with partial observability still maintaining its position in the world. And then here it is also interesting.

So, is swiping. So, this gives you like a Reaction

time? Like, the fact that it

can do depths and sequences in completely different views, right? So, this is a completely different view than if you were to be outside of that view, right? And so, it's able to maintain consistency. While zooming in. Yeah, exactly.

And so, yeah, so you can see. So even while this goes out of scope, right? Watch. And then it comes back and you'll see it's still there. Yeah.

And so, yeah, this is the work that that Anthony Hu has been working on.

I'm just wondering how much game footage you have to watch in order to find these things.

We can ask Anthony. No. It's it's I'm I'm sure he's not gonna be too excited to play these games afterwards.

You're not clicking, right? You're just watching.

Yeah, yeah, yeah. Great. Okay. So those were the models. Let's see.

These are interesting. So we also were able to distill into like really, really tiny models. So this is for instance a long sequence on a very very tiny one. You can see it makes like a bit more stupid mistakes. Like it does things that are not as optimal.

I haven't seen anything yet. The beginning it was running into a wall for free.

I mean I do that too. Yeah. Yeah. It's looks I mean it's doing pretty well.

Yeah. And and again all these models are running completely in real time. There's no

So I was thinking your main model does real time anyway. What's the goal of distilling? Is it cost or

Yeah. Parameters. Yeah. Yeah. Yeah.

This is the interesting one. Peaks the corner. That's what we mean by, like, the spatial and poor reasoning aspect is that humans actually they sort of simulate the optical dynamics of their eyes and how to actually spatial

I think

it's a little bit data, right? You've seen all this.

Yep, exactly. And so like even in like real this is kind of interesting, even in like the real world with for instance YouTube data, right, you have to first solve for pose estimation, then once you have pose estimation, maybe you do something like inverse dynamics, right? Where you basically are able to like somehow label some of the options that you're seeing. And then you still have to account for optical dynamics of like where your eyes actually looking before the decision, because like there's three levels of information loss. Or when you're playing video games, you're actually simulating the optical dynamics with your hand, right?

And I think that like that's why I think why games are a better representation of social support reasoning initially than YouTube videos for instance.

Okay. We're in the GI offices with CEO, Ken Dewey. Welcome. Thank you. Thanks for having us in your office.

Yeah. Excited to be here.

If I'm in New York and you're one of the hottest races of

the year, I have to come and visit and thanks for taking some time on the weekends or Yeah. Yeah. So you've raised a 133,000,000 seed for for general inspiration. Most people didn't care about you.

I I guess that's GI

is new, but more gamers would have found a medal.

Indeed. And before that, you ran in probably Wastew server.

Yes. On, like, the largest Depth. Wastew server. What's your reflection on just that event journey of, like, now you're an AI founder.

Yeah. And you started off playing RuneScape.

Yeah, I think So, I grew up with threats. I spent most of my time as a teenager coding and playing video games, so in that sense, it doesn't feel that much different. But I think for so, yeah, so I started the largest privacy of Roadrunescape, worked at Doctor. Cyberorders for three years, first in Ebola, and then on, like, satellite satellite based map generation for disaster response, which was already, like, very AI related adjacent. I built some models back then and then started Metal, became one of the largest social networks in video games.

I've always been kind of, like, AI, like, adjacent. I, you know, I'm a self taught engineer, so for me, the modeling itself always felt a little foreign. I actually had to take a ton of tons of classes over the summer and early this year to get better at it because it still felt like like I was really, really good at the infrastructure side and I had written, like, our transcoders for Metal myself, so I was very, very familiar with CUDA and, like, the GPU side and all the video infrastructure that we were using for this stuff, but the modeling side itself was still quite foreign. Luckily, obviously, we have I have really, really good co founders, but they essentially put a bunch of coursework together for me to go complete to get really, really good at understanding the fundamentals better. I think for me, I had seen inside of the labs that I had really, really good leadership with fundamentals on top and also the ones that didn't, and I think the ones that did were just like much better.

And so for me, yeah, I wanted to be more like that. So in that sense, it was a bit it was first very foreign, and then now I feel pretty comfortable with everything. But yeah, like, I think for there's a lot to be explored starting in video games. And also reverse engine like, think the interesting thing about reverse engineering is it kind of teaches you to look at problems very differently. It's like the ultimate form of deductive reasoning in a way.

And so this is just how I think, how I operate, and so for me, it's been a really, really interesting journey. You know, I don't claim to have any of the credentials or or skills that some of the other guests do have add on, but hopefully, it will make for a good time.

Yeah. Well, your co founders definitely bring a lot of that

different ability, and you bring a lot of the, I guess,

gaming expertise.

Well,

it's mostly with true cheese. We'll see what I bring to the table.

Yeah. Just just a little bit of history

of metal on mobile. Like, let's establish metal for those who don't know. The linear Twix. Yeah. The clear.

Yeah. That's you have more active users concurrent users in Twitch, something like that.

Yeah. On the creator side, I think. And the reason is because Metal is a lot more like Instagram than it is like Twitch, so people so the way the way to think about Metal is it's it's a native video recorder, unlike something like Twitch where you actually have to use other software to record and stream to Twitch. It's not a streaming software. It's actually a video recording software, and a lot of gamers love to put things like overlays on top of their videos.

And as a result of that, we have sort of the largest data set of ground truth action labeled video footage on the Internet by maybe one or two orders of magnitude. Yep. What what what's an example

of an overlay, like and the only overlay I usually think of is Ampliface CAD.

Yeah. Yeah. Also, controller overlays, for instance, if you're playing like, let's say you're playing Console. Yeah. Like, flight simulator, you get, like, you know, the joystick and all all the things.

So so you get the actual actions that people take inside the games as well as the frames of the games themselves Yeah. Which is a loop. Right? Because it's essentially you perceive, then you act, then there's a state update, and then you perceive again, you act, state update, which is like roughly precisely what you use in order to trace to train these agents.

Yeah. That's it's almost perfect training data.

We we are showing you were showing me in the demo

and we show some b roll here on how you don't log key. It's very important for you to log action. Yeah. When did you figure this out?

Oh, starting a year and a half ago. Yeah. And and we realized that, like, fig figuring out this side of the research for us was we very much never wanted to be in a position where we eroded privacy or something like that, so we never wanted to actually log, like, a w or a or s and a d, which for researchers, the fact that we don't do that, like, often it sounds strange, like, why wouldn't you do that? But I think for us to privacy

Well, then we get the data. Yeah. I I think, you know, a lot a lot of

the the researchers, they didn't they hadn't quite understood yet that you can actually just get away with just doing the actions, And the reason is, like, at training time, having the actual keys as noise anyways. Like, if there is text in the screen and you would want to, in theory, make that part of the training, then, like, reading text from a frame is, like, really easy. And so for us, if we actually so we convert basically, you the the input, we convert it to the actual action. So we had thousands of humans label every single action you can take in every single video game over the past year and a half, which is an enormous amount of action labels. Yeah, so when you act, we get the actual action itself and then it being said, at training time, you can, for like the general set of that game, convert back into computer inputs if you want to, but you can never do it for any individual person.

And so that for us from like a design perspective was important. So, we figured all that stuff out, then we actually started pushing like, we already had features as well with this. So, for instance, like, gamers already love to be able to navigate their clips by like things that happened. So, we have an events capture system, and then we also have the overlays where you actually just want to overlay and render the actions on top of your clip. We developed kind of in tandem with the feature set itself, and then, obviously, when when world models became a thing and it's very, very clear that all the all the data for this was precisely like that sequence.

Yeah. We were able to sort of be first to market, recruit the best researchers, and start a lab.

Yeah. And that's that's incredible. One more

question on metal before I remove full post on the GI. It's been ten years.

Yeah. What is the I don't even know how

you brought something like this.

You know

what I mean? I'm just

kinda curious at the end. Yeah. I like the opportunity to ask you, what really worked? Yeah. That you became so so

huge because I'm you're not the only one.

Yeah. But I have a choice, performance, and everything.

A few things that really worked. I think the first was a lot of our competitors were focused on solving the social network and the recorder at the same time, and that never like, our bet was really that we could get so many people record with us that we could bootstrap the network on top of that, and that worked. So, while everyone was sort of distracted trying to bootstrap a social network, we were just focused on building a really, really good capture tool, and then we got tens of millions of people to use that, which then we originally bootstrap a network on top of the share behaviors. We already had, like, the profile behaviors and the share behaviors obviously, but the actual content consumption piece and the sharing piece really only came after we hit critical mass. It was actually early days during COVID when, like, the network really accelerated.

Fortnite happened, which was really important, and I think also the fact that Discord existed made it quite a different time than when other types of networks of these types had launched because Discord essentially was, the connective tissue already between gamers that, like, never really existed before. And so I think those combination of things really, really made it. I think we also built a product that, for instance, with most video recorders, you have to remember to start and stop the recorder. So you have to go into the application, then hit start, then start your game, and then, you know, maybe you'll play games for three hours, and you'll close the game, then you have to close your video application. Then you have to process like a multi gigabyte file, then you have to upload those somewhere.

And so, like, this was a pain for people. And so what we did is we just ran this kind of recorder. When you hit that button, it does a retroactive video record. So all the recording initially is in memory, and then when you hit that button, it exports only that sequence to disk and syncs it to your phone, and so that that became super popular. It also what was interesting about it also means that you're not sort of behaving or acting differently because it's always there and you can just export whatever happens which is also very very helpful for for trading, obviously.

The thing you were the first to do that.

Yeah.

The thing you were explaining just before this was is similar to how Tesla

does the bump reports. Right? You're driving from the hacking, disengaged autopilot. Yep. They're like they're like, well, tell us what happened.

Exactly. Exactly. See see, you're driving. Tesla doesn't wanna train on the, like, ten hours of you driving through a desert where nothing interesting happens. You have the clip button on the steering wheel, something interesting happens either while FSD is engaged, and I'm not sure if you can use it without FSD as well, but you hit the clip button and it basically uses that precise sequence to mark, which is then more helpful for training because it's more unique as a training time.

Yeah. Yeah. I mean, so one thing I when we're gonna introduce on the agent side, one thing I I that's that does pop up is what a lot

of life is boring. A lot of life is going from

the hit e. A lot of life

a lot of playing games is doing

the boring stuff that is not capable. Uh-huh. Somehow using the generalized fight.

Yeah. Yeah. It makes you think. Right?

It makes you think.

Yeah. Yeah. It's also quite interesting. Like, I showed you the models, like, what happens when you increase the size of the context window and how behaviors actually are largely shaped by the size of the context window. Yeah.

That that to me was, like, one of the most interesting parts about the research. Made me think about our own behaviors in a way.

Yeah. Let's talk about also the, like, forming a chain. On your website, you

have 12. Right? I don't know if that's changed now. Before the three co founders.

Yep. And let's talk let's talk about

how this team comes together because you may not physio yourself up. You don't have that at the end of the network, but

you manage the elements to people.

Yeah. I started reading all the research papers. By that time, I was already pretty deep into, like, having a decent understanding of not world models, in particular LMs and transformer based models. And so, was Genie, there was Sima. Those two were really, really interesting.

And Sima in particular was interesting because what they do is they basically take 10 games and then they have a graphic in Sima where you can see kind of the precise actions that are inside of those games that they mapped, and I believe they found something like 100, which are actually actions that also exist in the real world, and what they did was they then I believe it was specifically for navigation. They did a nineone holdout set, so they trained an agent on the nine games, and then they had it play the tenth game, the holdout game, but then they also trained a specialized agent just on the tenth game and they compared how good they did. And if I recall correctly, it did roughly as well playing the tenth game on navigation specifically on the holdout on the nine game agent than it did on the one game agent. And that to me was really interesting because that's precisely the type of data that we had, right? And so for us, the thinking was, okay, what if we did exactly what LMs did?

What if we used, right, right, so LMs were trained on predicting, like, text tokens on words on the Internet. What if we predict action tokens on essentially what is the equivalent of the common crawl dataset but for interactivity?

Vision include? Yeah. Action output.

Correct. That's it.

But what I I think actually, I'm gonna double back a little bit to you, like, a question I had, which is one of the one of

the reasons why I thought you would want

to prefer keyboard and mouse over actions is the actions phase is potentially undaunted. Right? You can jump, walk, left, walk, right, but then also look up, look, left, the bench.

It's it's unbounded. So it's huge, isn't it? Yeah. I think Problem.

Yeah. There's there's benefits to the action space being small to start with. So I think we're we're gonna start with anything that you can control using a game controller. But, yeah, long term, we want to actually predict maybe, like, action embeddings and have models sit inside a general action space to be able to transfer out to other inputs as well. Got it.

Yeah. Okay. And then let's let's see going on the

on the research side. So Genie's involved Yeah. And then the co founders.

Yeah. So, there was the Diamond paper, there was Genie, and then there was Sima. The Diamond paper for me was really interesting because they had actually managed to get this world model called Diamond running on a consumer GPU, I believe it was a '4 90, at 10 FPS, and you could play it. And they did that on, like, ninety hours of data, like, ninety five hours. Think it was eighty seven hours and I think eight in the whole dataset or something like that.

That was just incredible, right, that they had something playable on that little data. So, I actually cold emailed the entire group of students and I told them, hey, I think we have this thing, and then it was pretty interesting. So, like, right when that happened, a lot of the labs also started understanding what we had, and so we started very aggressively. Multiple labs tried to bring us in in various ways, and they were part of that. Like, they basically were seeing that happen, and I think for them that also kind of solidified how real it was.

And then when we chose to do our own thing, you know, initially we thought that we were going to have to just work on role models, right? So we thought, okay, the main meta for this dataset LightGini, is warp models. What we didn't realize at the time is that we have so much of this data is that we can essentially do these role models in parallel and take the equivalent of, like, the LLM bet mostly on imitation learning and then use the role models after that to get into, like, RL stage. Right? And so for us

And eventually getting rid of the role models. This is something that you can

I I mean, ideally, you get rid of the imitation learning, but yeah? We essentially realized that we could get so far on just imitation learning. The way to look at it is we essentially, like, let's take the LM analogy. We essentially have sort of the internet or, like, common crawl, if you will, and every single lab is trying to simulate that, right, in order to get similar data, in order to train their agents. And so, for us, the reason why we stayed independent and we just did our own thing was we think we could essentially leap every single company that's forced to either be consumers of world models build world models and take this foundational model bet for spatial support agents and be in a place where, you know, we have a lot of customers years before any of the labs even get there.

And maybe the most similar comparison is like what Anthropic did with code, right? Anthropic just focused really, really hard on nailing the code use case. Their models are incredible for it. A lot of their customers use it for it. So we just want to become incredible at this spatial temporal agent use case, and likely that starts in, like, game simulation, and then using role models, can then start expanding out to other areas.

But would you show me a

little bit of how it does generalize objects, but although games is kind the common player?

Yeah. Games and simulation. I I would specify it as game engines in Verticiller. So even if you're, for instance, simulating human behavior in Omniverse because you're trying to create better training data for factory floors, you can use it.

Yeah. Meta has a similar dataset because of the Quest.

I never really asked them. I never really looked into the Meta Quest specifically. So you need a few things. You you can't just like, there's lots of companies that have, like, maybe recorders, but you also need the public graph. Otherwise, you can't train on the data.

Right? You can't train on people's, like, private videos that they have saved somewhere. Right? And so I think you need the social network graph components because these videos need to be on the internet. Shoot to rank?

No, to train on them. Yeah, mean, I think generally people don't, like, people don't want to train on, like, because these things, they live on your device usually, right? And you can't train on anything that lives on your device. You actually need to go and upload and do your For thing, Meta specifically, I think also VR, the scale of VR is still pretty small. The amount of environments in VR that have consumption at scale is probably in the hundreds, whereas on PC it's probably in the tens of thousands.

And so you get a lot less diversity. The three-dimensional input space of VR is pretty interesting. We see some of this too, obviously. And so, yeah, I do suspect, you know, starts using these types of things, but it's unclear to me whether they can get to, like, a similar scale of data or diversity on the environments as we can.

Yeah. There's lot of challenges there.

Yeah. Okay. I wanna take this

in in a few different ways.

But I guess let let's let's fill up the papers. Maybe one more I should mention is Tire

Yeah. Which I actually I interviewed the data authors, but that too seems like the Sequelior insight that that that that brought it over seed.

Yeah. So so Anthony too, who led the research on Gaia too, is is also one of the engineers that joined our team. So it's all the Diamond, the core contributors for Diamond and then Anthony. And we just had three more researchers showing this week. It's been a good week.

And yes, think a lot of the approaches in Gaia two were heavily inspired by Diamond, and then Vincent, who was one of the authors of Diamond, also already was at Wave by the time that I emailed them. Anthony also realized what this was and realized that that, you know, you could scale world models to a much larger, like, scale and decided just to to make the leap as well. So I think everybody that sees the dataset makes the leap because it's a but it takes a while to wrap your head wrap your head around it because it's like, oh, it's video games. Right? Like, intuitively, it doesn't make sense.

And then when you actually understand and you see, right, how we've been able to transfer it to physical world video and things like that, then it makes sense, and then everybody tends to

jump at that. I don't call it video games, probably r a l l m bar. So then

Yeah. If I lived in San Francisco, maybe I would. Yeah.

Just a

quick note because we actually cover all these papers in in the Ladies Day Super Club. Sima two did not seem to have as much impact on Sima one, and I don't really know why they did it a lot more work. Genie three had a ton of impacts. And but I I also felt like, because you can play with a model or people, it just seems like an extension of all those things.

But I guess, like, any quick takes on Sima two Gen eight three, which are both both this year's one?

Yeah. I'll I'll talk about Sima two. The steerability of Sima two was to me the most impressive part because lighting up the action sequences and the the text conditioning is is quite hard to do. Right? And so that and the fact that they were like, it's also quite interesting that that means that they can sort of use Gemini as as part of the flywheel, right?

You can sort of scale this orchestrator as like an independent, almost like a puppet master, if you will. And then, like, in theory, Gemini could orchestrate many instances of SEMA, right? That to me is the most interesting part is where I tend to agree with this, where like, I think our models will initially be used as like, you'll have like an orchestrator VLM of sorts that's kind of like managing instances and instructing them. And I think for SEMA showing that you can do this was fascinating. Also, the fact that you could they didn't just have text conditioning, but they also were able to do, like, drawings and markings of where to go.

They really took an interesting end to end approach to me that I I look forward to seeing a lot more of.

Are you talking to them? Like, is that

is that the one collaborating with me?

Yeah. I I think the yeah. We're very friendly with DeepMind. We like them a lot. I just saw the team not too long ago, and I think, you know, big fans of their work.

The the thin line that

I kinda shake from Alice's coverage of you Yeah. Is you are the biggest bet that Vino Crossout has made since OpenAI. Yeah. How did that conversation start?

Okay. So what Vinoids style, and may maybe I'll get slapped in the fingers for revealing this or whatever, but Forgive me if if it were bad. Is he asked you to, like, draw a 2,030 picture of your company and I think he just picks n plus five years but whatever. I don't know.

I did the same for you.

Yeah. He asks you to, like, walk that back from first principles all the way from today, and and and he asks he expects you to do that flawlessly where he can challenge any assumption, any part of the vision that that he asks two questions. Right? He has a very technical background. He also has a bunch of technical people on his team, and he truly backs people that have these, like, very large visions on that vision and the ability to to defend it alone.

And that's what he did for us. And I think that's why he made that bad. So, I think also through this question, he gets to know a lot of things about how technical you are, he gets to know how well you think from first principles because if that vision is not connected to something real, it's very easy to suss it out by asking good questions. And then and then he just backs fully, I think. Like, he he really gets in your corner if it's the right fit.

And, yeah, they've they've been incredible partners. They they they've opened so many doors for us.

I had to ask the question.

I think it just, like, it's it's a it's

a very notable story.

Obviously, a lot of work went into it and

and but it's also worth it and put him out of Yeah. Side. For sure. One of the things also wanted to I I think I kind of asked this question out of sequence, but one of the things that is exciting about talking to you is there are a lot of people like you who are founders of business and businesses that along the way have a ton of data. And yours happens to be highly valuable.

You pursue it before deciding

to do an independence journey. You also talk to other companies about potential licensing or acquisition and stuff like that. What is your learnings from those periods? Like, also, like, one one version of this is very simply, how do you value data?

Yeah. I don't think you can value it unless you actually model it yourself and see what the capabilities are. That's my that's my real outcome. You

say model, but train a model.

Yeah. But that's obviously, like, not doable for everyone. And also, I think my general advice would be as model capabilities increase, you and models are also, like, these VLMs out there are very, very good at labeling as well, generally, right? What I was afraid of when I was having some of these conversations was, okay, like, know, as the capabilities increase, you're just going need less ground truth data and, like, you can do more model based data generation or synthetic data generation. I would recommend if you're gonna do large data deals, like, just try to get, like, a large chunk of equity in the company that you're doing it with, if you can.

Now a lot of them won't do this, but I think that to me would or just go do the research, figure out what's actually possible. In our case, we were quite lucky in the sense that this is actually the foundation data. Right? And I think right. Like, that's not true for for every dataset.

I think, know, we just happen to to hit a particular gold mine.

But you you also did you read, clearly, Brady, you did the action thing, like, one point five years ago.

Yeah. So so you didn't work. Yeah. That's the thing. Like, you you you have to be grounded.

Right? And I think a lot of the and I think that's the hard part, and I think a lot of what's interesting is you can also kind of look for if scaling laws already exist on your data type, which, like, for video there were some, but for these input action labeled sets there really wasn't any. The other question is like does it go into LMs? Does it go into world models? Does it go into like what type of model is it going to be used for?

And I think that's an important thing to know. So I just want to, you know, if you're having these conversations with labs about data, just like make sure that you actually understand like what it's going be used for because that's a very, very good way for you to, like, make the decision yourself about whether you want to pursue that. Now, a lot of them won't tell you that, and I think, you know, I think in that case, you generally just don't want to do it because, like, I think for our case, like, we really cared that, like, for instance, there weren't going to be competing products with game developers built, right, because we didn't wanna, like, bite the hand that feeds us and I think we are part of the games industry. So those questions, I think, are normal and then we eventually decided, you know, you just have the data, we're just gonna go do it ourselves and that's when the rest happened.

Yeah. And he assembled the team and then Yeah. Think about it to that. I I feel like that's you've aligned a lot of stars in order to make GI happen. Yeah.

That other data founders, they are at the beginning of this training. Yes. Oh, I'm a data founder. Founders who happen to have data. But they had a main business, right?

I I don't know if you're either there.

There's two sides to this, right? There it's really easy to be super naive about it, and, like, I had a lot of people tell me initially, oh, it's not that valuable. You're just, like, making this up, and and and so for me, like, doing the work and actually understanding it myself was a really, really big part of of building that confidence and go start a company.

But a

lot of times it is true that, like, model capabilities increase so quickly that, like, the certain data you just don't need anymore. Yeah. And so I think it is it's really important to, like, get people to do the work such that you can make these types of distinctions. Yeah. And and and so so my recommendation would be go build models with your data, see if you can create any sort of capabilities that that aren't clearly already there.

Yeah. Or on path to being there and then figure out where you go.

Yeah. I did wanna ask

this earlier, but you gave me the opportunity to we say, do the learning, you do coursework and all that, and your cofounders gave you some homework. Yeah. Is this like some books? I mean, Coursera?

No. This was,

Francois, Flores Flores. So he has a little book of deep learning, and then he also has a full course that he's published, on his website. I went through the entire course, over the summer. I believe it's, like, something, like, 30 or 40 lectures, which also take home projects and things like that. And I would recommend anybody does this.

Goes through, right, history of deep learning, like the topology. It takes you through the linear algebra, the calculus, eventually end up with, like, chain rule, by this time you've done, like, all the more important concepts, it takes you through how do you create neural networks using these concepts that you've learned.

Wow, this is super first principles.

This guy, and I've I've I've had the the the opportunity to spend some time with him as well. He is one of the most first principles people I've met in my entire life. I'm convinced like, I actually asked him why did you do this course? He's like, oh, like, because I thought all the other courses weren't right. And because he is so first principles, he can only explain things from Like, everything you see in how he explains this thing, everything is from first principles, including, like, the history of deep learning itself was part of the course.

And yes, through everything, and by the end of it, I think I now have a pretty good intuitive understanding of how everything works. But obviously still, right, I like to describe it as I'm like the the guy who just got his driver's license. I can drive the car, and, like, my co founders are like the f one drivers that, like, have done this for years. They know where all the, where all the the gaps are, and and so I I enjoy getting to learn from them. The cool thing is also that Warp Models is just like a very, very new space.

Yeah. And so, you know, I I got to bring ideas to the table that, like, no one thought of and not because I'm great at this, just because it's such a new space that, like, people just haven't tried it yet. Mhmm. So

Let's get a hit on definition. Yeah. What are world models to you?

You know, in a video model, you might predict the next likely sequence or the next most entertaining next most entertaining frame. What world models do is they actually have to understand the full range of possibilities and outcomes from the current states and based on the action that you take generates the next states, right? So, the next frame. And so it is a it is a much more sort of complex problem than than traditional video models. So to me, it is it is a world that is accurately generated based on the actions that you take as a result of what's already been generated.

And just to fact check, that is it needs to understand physics. It needs to understand if I'm building a type of material in it, how it interacts with some type of material.

Yeah. I think the interactions is the most important part. I think the reasons why world models are so fascinating, one of the things that I did when I was studying over the summer was I tried to actually build a super rudimentary PyTorch based physics engine, which I would not recommend writing a physics engine in PyTorch for obvious reasons, but I wanted to be able to, because it's differential, so can generate the

It's a bit.

Yeah, exactly. And then you can train. And so I wanted to you know, I got so many people ask me about, you know, why aren't you just using why aren't you just simulating or generating this data? And I really wanted to understand from first principles why. And I think the most important thing that I figured out was the compute complexity of simulation goes up really, really rapidly with three variables.

First, the numbers of agents in an environment. Second, DARE DOF, so their individual Duals

or freedom.

Yeah. And then third, the information that each action reveals. So like, for instance, if a you text action or a speech action, the environment can change so much based on whether you say, right, water or fire, that the outcomes are going to be completely different of like how a human would behave in that type of situation. And so it goes up so quickly with those three variables that at some point you just hit a point where you just want to maximally bet on either video transfer or generation of these environments using world models because that type of stochasticity is just incredibly difficult, but it's already very, very present in a lot of the video pre training that goes into these world models, right? And so I think for us, it is more so about making a maximal bet on video transfer and interacting with things that are difficult to simulate, and the steerability is also really interesting with text than it is on betting against simulation or something like that.

And so I think there's still a large market for traditional simulation engines, specifically in areas where video is really hard to get.

Is this exactly what the big labs are also

saying when they're talking to that?

I honestly haven't talked about the big to the big labs. Like, since we started working on them ourselves, I think people are more reserved with what they share with us.

Yeah. Of course. With him and said, that's probably another question. How would you contrast your version of world models with Fei Fei Yang Mi Fun.

Yeah. So I don't know exactly what Yang Mi Fun is doing today. My understanding, it's based on the Fei Fei Li, like, Le Fei Fei Li approach, which is so I'll start with Fei Fei Li. I think what's really interesting about Fei Fei Li's approach is that you in some way are able to reuse the spots, right, in game engines and in things that let you stay in verifiable domain, which I think is a really interesting approach. However, my understanding is they're currently not interactive, which in my opinion is like the whole point of world models.

Right? It's environments. They're great environments. And I think from a business perspective, I think they picked a really important part of the tool chain. But to me, that's not really a whirlpool, but I my my guess is they'll get there.

Right? They'll they'll start generating

Yeah. They just have a reuse date.

Yeah. Exactly. Exactly. And I think, right, Fei Fei is one of the, like, founders of the entire space. So I think it's gonna be really interesting to me on on on what maybe that interactive piece looks like for me to really judge their approach.

I I I think

We interviewed just before

we moved to Young, we interviewed her with Justin Johnson, her co founder. He was she was more focused on the physics side of things Yeah. And the interactivity. And they just haven't been listed yet. But I I I I do think that basically that the stats, if you just add more dimensions on, I guess, the forces acting on them, then then you get to attractivity out of the box.

Yeah. Because you are basically these are virtual items that then has all the normal physics applied to them.

Yeah. I'm I'm excited to see what that looks like when they actually release it. It's really hard really hard for me to comment on anything in I the really like the frame based approach because all of our video or all of our training data is in this format.

Yes. Oh, yeah. So they we actually asked

them about this and they were like, yeah, it's possible, but we're they're choosing the spec.

Yeah. And you can also go from splat to frames. Right? I'm sure you can write, like, at some it's it wouldn't be easy. Like, you'd have to actually render out the environment, do the sure.

It's not going to be a simple problem, but, like, in theory, it has to be something that you can do if you really wanted to. So I could because it's almost like having a more sort of ground truth, three-dimensional representation of the underlying world, right? So I think it's an interesting approach. It might be overkill, right? You're also dealing with like a much larger, like, degrees of freedom on the output space, right?

So who knows how well it scales? I like the fact that, like, I think these video models also use things like autoencoders, right? You can actually have the world models predict like much smaller, maybe like a

Resolution or size?

Yeah, exactly. And then you can use diffusion upscaling or methods like this to actually enrich. And so I think that world models just allow a much more, or world models in my sense for a much more, like, controlled space that that that we know really well. Yeah. I'm not suggesting their approach is wrong.

I'm just, you know, like, this is, I think, what we really like about it. Honestly, Jan's podcast that he did, I don't remember which one it was, but a long time ago where he where he basically proclaimed LLMs to be a dead end was one of the things that inspired me to do this.

I think this is very consensus among low models people. Basically, everyone who heard

this is like stops looking at LMs and just goes through to low models. I would say that the main pushback, I asked this exact question to Nolan Brown from open the eye, and he was like,

well, b learning physical models. Right? So this basically, the

the difference between two c and m versus the SMB. Or what do you wanna put down here,

Ron? Yes. So

yeah. I I I'm not one to proclaim LLMs are a dead end, personally. I think I think they're actually quite useful, and particularly as orchestrators. Like, the way I think about it is, as humans, right, we had sort of a three-dimensional world, then we invented text as like, in a way, a compression method, right? So, had We invented text in order to communicate with each other in a common way, in a way that actually compresses all of this information that we are perceiving in three-dimensional space into just like a single sequence.

And I think that allowed science of streamer, allowed so many literature, like so many parts of the world that we we charge. So, I think it's a critical part of the whole picture. I also agree that it's very, very clear that they do build sort of the internal implicit world models inside LLMs. And so, I think they'll be very helpful as things like orchestrators. The problem is when it comes to the generalization, I think text as a generalization backbone.

When most of the pre training is text, right, or largely text sequences, then I think you want that backbone to be kind of more spatial and plural in nature and then also just have text like as one of the as part of that. And I think the actual argument of LLNs is also, for instance, the autoregressive nature of the prediction itself. So, the fact that it's running the entire output right through the transformer, and then in order to predict the next token, which doesn't like the environment in the real world is continuous, right? It's changing. And LLMs kind of just forget about that, right?

I think a lot of the argument isn't first, right? So, I think the fact that like text doesn't necessarily generalize well to sufficient plural context and then the autoregressive nature of the prediction and using text for that, right? So, I think those are the two main arguments. I think text prediction is just one of the actions that is going to come out of these policies and world models. I think speech and text generation will just be one of the actions that can be a part of that.

I think that there will just be labs coming at this problem from both sides, And everyone ends up in roughly the same place, and the same place will be whatever people think is cool. Right? Like, whatever the consumer uses

Whatever closest

to AGI. Yeah.

And so I don't think there's a clear answer. I think it's really interesting to come at it from the world modeling side, but it's also because we have to. Right? Because like text is largely commoditized. We can import all the text.

I think it's interesting and tempting that a

lot of tempting, it makes sense that you

can probably recover. It's sort of like you you are taking a step back. You're studying your branch of the ML research sheet,

but you

might they actually just end up recovering all the other tech stuff emergingly.

Yeah. Yeah. We can import a lot of that research. Right? It's it's a lot of that is

That's really cool on the on the research side. Let's talk about the stuff that GIS

what do you say more like the the I guess the the sort of research and products outputs. You mentioned the word customers. What are your target customers?

Yeah. So we're we're already working with some of the largest game developers in the world. Yeah. We're also working with game engines directly. And so really what we're doing at the moment is replacing essentially the player controller inside of a game engine.

So anything that you're currently that maybe like behavior trees or things that you're deterministically coding, we hope to replace with a single API, which is just you stream us frames and we predict actions, and that can be inside an engine or it can be eventually even inside the real world. Hopefully, those are then also steerable. So the models that you saw weren't text steerable yet, but I think we want to get to a point where they're fully text steerable.

But to see steerable means like, well, I want you to go to share, figure anything else out in the pre.

Yeah. I think it's it's sex conditioning on the generation. So, yeah, the ability to to you're right. We want to get to a point where you can generally and that's why it's called general intuition where we can sort of mimic the intuition of all these gamers into human like behaviors in any situation. As I mentioned also, the lab is named after Demons and Szabdaszkooth from AlphaFold, which is, wouldn't it be amazing if we could mimic the intuition of these gamers who are, by the way, only amateur biologists on his path to He tried to get an AI to train FullDitch to generate a lot of data for AlphaFold.

And so for us, really, the north star, right, what we hope to get to one day is being able to represent scientific problems in three-dimensional space and then have a space in the portal agent capable of perceiving that space and using hopefully also the text reasoning capabilities that LMs have today in addition to the space in the portal capabilities to be able to work on the other side of that problem. So that for us is sort of the North Star. That's why, you know, we're sort of trying to be hyper focused, patient portal workloads the same way that Anthropic was hyper focused code, and use that to then get into organizations and expand from there. Yeah.

Just as a side note, since

you mentioned Anthropic, any idea what they did on this to to solve coding? Yeah.

No. Out of any lab, I probably know Anthropic the least to go Yeah. I admired him, though.

Yeah. Well, the the the current working theory is that they had a

super lucky roll of the ducks. But well Alright. And and then it compounds from

there. That sounds like a nice story. I'm sure I

saw that.

Yeah. Okay. So why did the game developers want this?

So if you're a game developer, how well you're actually retaining players is like, if you have a game that's already at skill is like decently dependent on how good your bots are. So if you're logging in at an obscure time, let's say 3AM in America, your player liquidity is low, then you need really, really good bots to keep those players engaged.

Is this known is this a thing? Yeah.

For sure. For, like, Fortnite and whatever.

A lot of people work with this. Yeah. And so if if you're, like,

as a human, do I wanna play against bots?

Usually, it's not just bots. It's like players mix in with bots because you don't wanna play just against bots, but it's better to have a full game than to have, like, an empty game. Yeah. And so I think as long as it's part of the environment, I think it's okay.

That means you also have to sort of grade that skill level.

Yeah. Yeah. Which we can do because we have we know exactly how good people are with these games. Yeah, yeah, I think for us bots is kind of like step one, right? So what I was showing you is we're building a general agent that can sort of play any game in real time, but really that extends into all of simulation, right?

Like in GTA V for instance, are genuinely role playing real life, And so they're actually behaving in quite aligned ways with the goals they set for themselves. So you have all these examples represented in video games where you have truck simulator, power wash simulator.

Power wash simulator?

There's power wash simulator where, like, actually, the behaviors that you'd want an agent to be able to perceive, they're all there. Okay.

Minus that saved.

Yeah. It's really it's really, like, how

seriously some gamers take truck simulator. If you haven't seen these tips, you should watch it.

Yeah. They buy the whole, like, truck driving set,

and they're doing the job of a truck driver.

Yeah. What what I mentioned to you, we have more people at any given time on metal playing with steering wheels and, like, truck simulator and these types of games than Waymo has cars on the road. Yeah. It's a ridiculous stat, but it's true.

Yeah. I mean, is it so,

you know, I I I used to think that quality self self driving, you kinda just need to play a lot of GTA five. Yeah. I mean, it's bad for this.

Yeah. Our bet is not that we can zero shot any of these things. It's just that, like, the next self driving company can maybe have collect 1% of the data because, right, also, for instance, Klipsch already self select into negative events and adversity. Right? So, a lot of our dataset, because already highlights, is really precisely what a lot of these companies spend their last 20% doing.

And I think that's the main argument if you're another company that's looking at what we're doing. I think the thing that people won't understand is that anything that you're currently doing in pre training, as long as your robot can be controlled using a game controller, we hope that we can move that to post training for you. So our bet is not that we can create the next driving car company. It's just that the next self driving car company hopefully only needs 1% of the data or maybe 10% of the data, I don't know, right, to be able to deliver a really good product.

Yeah. Yeah. It's also the the term that comes to mind a lot is active learning. I don't know if you've you sort of identify with that.

It's not it got less cool for a bit, and now

it seems like down the uptrend a bit, which which obviously you have the best dataset for the sort of high intensity or you said negative, but I feel like

found negative. It could be negative ballpark of it.

Yeah. For sure. I think negative events is just because it's the most common term that people use for, like, if you're if you're Tesla, you want the crashes. You want, like

Right. Yeah. Right. Right.

Right. But but it's So gaming.

It's ball of beer. Yeah. So, you know, the the model that you saw obviously had really, really incredible moments, and and that was largely driven by the fact that

yeah. Yeah.

That that it had the large representation of people at their best.

Yes. Yeah.

And worst.

Yeah. Yeah. Yeah. Amazing.

Okay. Cool. Anyhow, anything else on the

customer development side that you wanna sort of flinch off?

Yeah. We're also already working with robotics companies, but again, that and manufacturing, but the key is that the robot has to have gaming inputs. So, we're like, our bet is not that we can transfer over to, like, higher DOF robots than the keyboard and mouse. It's really just that we can move the hard work of of of pre training, hopefully, to post training.

Yeah. It's like kind of like

the foundation model that is a very good basis to start.

Yeah. You're gonna straight you're gonna give us frames and and likely some text. Or you'll license the

model to because they've been the one in post training.

Yeah. Our business model is initially going to be an API, again, like the Anthropic API, but you also saw, for instance, some of the video labeling models that we've been able to develop. So the goal is for any company to be able to take in their video data as well, and we can create first obviously custom versions of the policy for you, the agent. If that doesn't work, then we've already working with a customer that is doing we distill a model and they turn that into a product for themselves.

People can engage with you

on the agent level, API level. Mhmm. People can engage with you on the sort of model level. Can they also buy data? No.

Alright. Yeah.

We don't sell data. Okay.

Cool. So that's the that's the business.

And is there a world in which

I I mean, I I think this is on

your landing page. If you are, you know, Frontier Labs for for world models, is there a world in which there is a more sort of application layer thing that you that comes out, like a chatty kitty for whatever.

Yeah. You're gonna see us launch a few things on on Metal itself that are gonna blow your mind as a result of this this this agent. I'll I'll leave it the imagination for now.

If people try to figure it out, you know, and Yeah.

On the world modeling side, like, I think one people underestimate is that Metal is already one of the largest, you know, video consumption platforms as well. People watch millions and millions of videos a day.

Whizzle.

So, world model based entertainment and things like that, well, it's not like a focus for us right now, I think we'll be, like, the consumer side, we have the ability to move very, very quickly here and and get it integrated in a way that I don't I don't think anyone else can.

Yeah. You could theoretically do a video gen, like a Sora,

like a what is what is that Instagram y? What's the meta one? Meta mules? Not not Reyes.

Vote wise? Yeah.

You could theoretically generate clips that nobody play, but even though it's gonna be viral.

Yeah. I I I think for us, the games being so human centric is, like, a really big part of what makes it special. Like, I I actually I actually just don't think that will work. Like, one thing that we are really excited about, though, I'll I'll give you one sneak peek of what we're thinking about is what if you could literally replay any of the clips that you have inside a world model or your friends can play them? Like, I showed you a model that already took part of your clip as a context.

Since the replay entered that world

But it's also how we go from imitation learning to RL. Right? Because, like, it's part of our research roadmap anyways to make every single every single clip on metal playable. So, who's who is to say that that doesn't apply to just the actual clips that you take?

Yeah, Can you sing more about the RL potential?

We describe Metal as the episodic memory of humanity in simulation. So when you take a clip, really the way to think about it is you get the highlight of what is maybe three hours of playtime, right? You maybe get like two to three minutes of the things that were the most out of distribution, right? It is genuinely your episodic memory of that playtime and simulation, the things that you most want to remember and share. We want to be able to load, and this is the work that Anthony Hu is doing, the reason why we built world models, is every crash that you run into in Euro Truck Simulator or American Truck Simulator or a driving game.

We want to be able and again, these are ground truth labels, we know precisely the actions that lead up to the negative events. They're also title labeled. When people upload it onto their platform, they say, oh, good, it's a crash. And so we can select all these events, and if we can put them inside a world model, we can go into Right? We can train reward models to then reward based on how you perform in clips that actually contain negative events, for example.

And so for us it's very much about Right? We can create this LLM moment a video in imitation learning, but actually making every single clip on the platform playable at billions of clips scale is how we go from imitation learning to RL.

Cool. We covered a lot of it. Is there anything else that you wanna do before we sort of grapple with

the the the long term vision stuff?

Yeah. Yeah. I think I think for us, this is a very, very ambitious long term bet. We need the best researchers in the world that that that that want to work on this stuff. It's really exciting not being extremely data constrained.

Like, really get to like, we get so many learnings every week that we didn't think were possible, and it makes it for a joy working here. Also, the other thing is because we have such a large data moat, we don't have to be as concerned as the LM companies about publishing because We don't

need the ones that are able to.

Exactly. No one can replicate the models, right? And so for us, we really want to bring back like the original culture of open research, which is why we did the partnership with QTai in France.

I actually didn't. Yeah, we

just announced our partnership with QTai in France, which is an open science lab in Paris, one of the best research labs in the world. Eric Schmidt, I believe, funded it in addition to some French people. They are essentially acting as the partner that's currently doing a lot of open research on the data. We also want to partner with universities who because like, we do believe this is the frontier, but it's so data constrained that really everyone has their hands tied behind their back right now, and so we want to help fix that. So for instance, we want to work with universities to build like negative event prediction models for maybe like trucks in India on all the truck data where all these crashes occur.

We have all these things that we know we can do that we just haven't had the time to do. And so if you're listening to this and you're maybe an academic institution or something and you want access to some of this data in in educational research fashion, think we're we're quite open to doing that because we want to educate people. And, yeah, and other than that, we just want to work with the best infrastructure and research engineers in the planet as we're going into scaling, you know, runs that have thousands tens of thousands, eventually hundreds with thousands of GPUs.

Yeah. Yeah. Amazing. I primed you this as the the closing question. Yeah.

It's like it's a little bit of a no cost, 3330% of I didn't know. Yeah. So what does GI become in by the risk?

Yeah. In 2030, we want to be the gold standard of intelligence, and any sequence long enough is fundamentally spatial and plural, right? Which I think is So, by nailing spatial and plural reasoning, you go after the root killer problem of intelligence itself. What the world looks like is we want to have eight so I sort of group the sequences of AI in three stages, and I credit Andre Kaparthi for teaching this. Bits to bits, atoms to bits and bits to atoms, and then atoms to atoms.

In the atoms to atoms stage, I want GI models to be responsible for 80% of all the atoms to atoms interactions driven by AI models, the reason for that is because we were able to unblock intelligence so quickly and robotics, like intelligence is the bottleneck, that supply chains actually converged on gaming inputs as primary input methods, and they converged on essentially simpler systems that let us do a lot more a lot quicker. So we are essentially the 80% market approach, and you have lots of companies that have kind of like specialized maybe humanoid robot OS stacks that are the other 20. And then so I want to be responsible for 80% of all the atoms to atoms interactions driven by these models and be the gold center for intelligence and maybe 100x more in simulation, because I think simulation will actually be the larger market initially. So I think in simulation, because you have very little constraints, also from a safety perspective, simulation is much easier. So I think a lot of the takeoff initially sits in simulation, so a lot of the simulation use cases like what I mentioned, scientific use cases, I'm really, really excited about.

And so, 80% of ADAMS ADAMS interactions coming downstream from these types of Spasion and Borel foundation models, and then 100x more in simulation.

Yeah. That here. It reminds me a lot of that what Marc and Fusilla from the Chancellor Verbier Institute are doing with virtual biology. Because you can do a lot of pre simulation, then you can do yeah.

Or you can do it a lot faster

with interest. Amazing. Thank you for inviting us to your office. Yeah. And thank you for sharing them a bit while you're joining.

Thank you. Yeah.

Latent Space: The AI Engineer Podcast

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

0:00 / 0:00

View original episode →

So these are purely these are pure mutation learning.

I see. This is slicing the knife.

How long is the memory?

It's very human, let's just say that.

Yeah. Then, right, so this was really the early days of research where you can see, right,

it does one thing and then goes for another.

And then we've been scaling on data and compute, also we've just been making the models better. And this is where we are now. So what you're seeing is pure

I said, pure mutation learning. This is just a base model. There's no RL, no fine tuning. This model sees no game states. It is purely capable Not sequence, acceptance.

Do you give it a goal? Yep. It just figures out it's on a go because obviously it's trained on the same thing. Yes.

And I I picked right I picked the sequence where also it doesn't do well initially, so you can see like this is just like a sequence, a random sequence.

But this is the the I mean, it looks like it's doing well.

So Oh, okay. Yeah. Watch. Yeah.

That's pretty good.

Maybe too good.

This is my my favorite part. So you can see it does something that, like, here,

like Neumann would never do this, then gets unstuck, then has four realizes switch, and then in the distance.

So you're

saying, one, it makes a mistake that a human will never make Mhmm. But it unsticks itself.

Mhmm.

And two, what we just saw is it is doing superhuman things. Yeah. Okay.

It is, yeah, replicating superhuman. Yeah, Or like peak human.

The baseline of our data set is peak human performance.

Yes. Yeah.

And then this is reset per prediction.

Reset meaning every now and then you reset?

What was it predicting?

It's predicting it as if you were controlling it using keyboard and mouse. So if you were if you're basically playing this sequence as a human.

Is there some sense of error or?

This one is also this is the same agents that I just showed you.

This is playing against other AIs?

That makes sense. Yeah. Intuitively. Yeah.

I think there's also a question about egocentricity versus, sort of third person. Yeah. Does it matter?

What do you mean this is the policy agent? The agent.

Yeah, same for the strings that I just told you about. Good. Yeah. Like this, it hides, that to me was just incredible. Like, just from knowing, being able to predict.

The

appearance is also high when

you see it. Exactly. Yeah. Yeah.

And it needs the spatial intuition to go, well, this is hiding and that's not hiding.

Exactly. And right while it was reloading, yeah. Okay, so that, so those are, that's the policy and this is a completely general recipe meaning we can scale this to any environment.

Is this work closest Okay, no, let's keep going on demos until I

was going go into research.

So, is swiping. So, this gives you like a Reaction

time? Like, the fact that it

And so, yeah, so you can see. So even while this goes out of scope, right? Watch. And then it comes back and you'll see it's still there. Yeah.

And so, yeah, this is the work that that Anthony Hu has been working on.

I'm just wondering how much game footage you have to watch in order to find these things.

We can ask Anthony. No. It's it's I'm I'm sure he's not gonna be too excited to play these games afterwards.

You're not clicking, right? You're just watching.

Yeah, yeah, yeah. Great. Okay. So those were the models. Let's see.

I haven't seen anything yet. The beginning it was running into a wall for free.

I mean I do that too. Yeah. Yeah. It's looks I mean it's doing pretty well.

Yeah. And and again all these models are running completely in real time. There's no

So I was thinking your main model does real time anyway. What's the goal of distilling? Is it cost or

Yeah. Parameters. Yeah. Yeah. Yeah.

I think

it's a little bit data, right? You've seen all this.

And I think that like that's why I think why games are a better representation of social support reasoning initially than YouTube videos for instance.

Okay. We're in the GI offices with CEO, Ken Dewey. Welcome. Thank you. Thanks for having us in your office.

Yeah. Excited to be here.

If I'm in New York and you're one of the hottest races of

the year, I have to come and visit and thanks for taking some time on the weekends or Yeah. Yeah. So you've raised a 133,000,000 seed for for general inspiration. Most people didn't care about you.

I I guess that's GI

is new, but more gamers would have found a medal.

Indeed. And before that, you ran in probably Wastew server.

Yes. On, like, the largest Depth. Wastew server. What's your reflection on just that event journey of, like, now you're an AI founder.

Yeah. And you started off playing RuneScape.

Yeah. Well, your co founders definitely bring a lot of that

different ability, and you bring a lot of the, I guess,

gaming expertise.

Well,

it's mostly with true cheese. We'll see what I bring to the table.

Yeah. Just just a little bit of history

of metal on mobile. Like, let's establish metal for those who don't know. The linear Twix. Yeah. The clear.

Yeah. That's you have more active users concurrent users in Twitch, something like that.

And as a result of that, we have sort of the largest data set of ground truth action labeled video footage on the Internet by maybe one or two orders of magnitude. Yep. What what what's an example

of an overlay, like and the only overlay I usually think of is Ampliface CAD.

Yeah. That's it's almost perfect training data.

We we are showing you were showing me in the demo

and we show some b roll here on how you don't log key. It's very important for you to log action. Yeah. When did you figure this out?

Well, then we get the data. Yeah. I I think, you know, a lot a lot of

Yeah. We were able to sort of be first to market, recruit the best researchers, and start a lab.

Yeah. And that's that's incredible. One more

question on metal before I remove full post on the GI. It's been ten years.

Yeah. What is the I don't even know how

you brought something like this.

You know

what I mean? I'm just

kinda curious at the end. Yeah. I like the opportunity to ask you, what really worked? Yeah. That you became so so

huge because I'm you're not the only one.

Yeah. But I have a choice, performance, and everything.

The thing you were the first to do that.

Yeah.

The thing you were explaining just before this was is similar to how Tesla

does the bump reports. Right? You're driving from the hacking, disengaged autopilot. Yep. They're like they're like, well, tell us what happened.

Yeah. Yeah. I mean, so one thing I when we're gonna introduce on the agent side, one thing I I that's that does pop up is what a lot

of life is boring. A lot of life is going from

the hit e. A lot of life

a lot of playing games is doing

the boring stuff that is not capable. Uh-huh. Somehow using the generalized fight.

Yeah. Yeah. It makes you think. Right?

It makes you think.

That that to me was, like, one of the most interesting parts about the research. Made me think about our own behaviors in a way.

Yeah. Let's talk about also the, like, forming a chain. On your website, you

have 12. Right? I don't know if that's changed now. Before the three co founders.

Yep. And let's talk let's talk about

how this team comes together because you may not physio yourself up. You don't have that at the end of the network, but

you manage the elements to people.

Vision include? Yeah. Action output.

Correct. That's it.

But what I I think actually, I'm gonna double back a little bit to you, like, a question I had, which is one of the one of

the reasons why I thought you would want

to prefer keyboard and mouse over actions is the actions phase is potentially undaunted. Right? You can jump, walk, left, walk, right, but then also look up, look, left, the bench.

It's it's unbounded. So it's huge, isn't it? Yeah. I think Problem.

Yeah. Okay. And then let's let's see going on the

on the research side. So Genie's involved Yeah. And then the co founders.

And eventually getting rid of the role models. This is something that you can

But would you show me a

little bit of how it does generalize objects, but although games is kind the common player?

Yeah. Meta has a similar dataset because of the Quest.

Yeah. There's lot of challenges there.

Yeah. Okay. I wanna take this

in in a few different ways.

But I guess let let's let's fill up the papers. Maybe one more I should mention is Tire

Yeah. Which I actually I interviewed the data authors, but that too seems like the Sequelior insight that that that that brought it over seed.

And then when you actually understand and you see, right, how we've been able to transfer it to physical world video and things like that, then it makes sense, and then everybody tends to

jump at that. I don't call it video games, probably r a l l m bar. So then

Yeah. If I lived in San Francisco, maybe I would. Yeah.

Just a

But I guess, like, any quick takes on Sima two Gen eight three, which are both both this year's one?

They really took an interesting end to end approach to me that I I look forward to seeing a lot more of.

Are you talking to them? Like, is that

is that the one collaborating with me?

Yeah. I I think the yeah. We're very friendly with DeepMind. We like them a lot. I just saw the team not too long ago, and I think, you know, big fans of their work.

The the thin line that

I kinda shake from Alice's coverage of you Yeah. Is you are the biggest bet that Vino Crossout has made since OpenAI. Yeah. How did that conversation start?

I did the same for you.

And, yeah, they've they've been incredible partners. They they they've opened so many doors for us.

I had to ask the question.

I think it just, like, it's it's a it's

a very notable story.

Obviously, a lot of work went into it and

You pursue it before deciding

Yeah. I don't think you can value it unless you actually model it yourself and see what the capabilities are. That's my that's my real outcome. You

say model, but train a model.

I think, know, we just happen to to hit a particular gold mine.

But you you also did you read, clearly, Brady, you did the action thing, like, one point five years ago.

Yeah. So so you didn't work. Yeah. That's the thing. Like, you you you have to be grounded.

Yeah. And he assembled the team and then Yeah. Think about it to that. I I feel like that's you've aligned a lot of stars in order to make GI happen. Yeah.

That other data founders, they are at the beginning of this training. Yes. Oh, I'm a data founder. Founders who happen to have data. But they had a main business, right?

I I don't know if you're either there.

But a

Yeah. Or on path to being there and then figure out where you go.

Yeah. I did wanna ask

this earlier, but you gave me the opportunity to we say, do the learning, you do coursework and all that, and your cofounders gave you some homework. Yeah. Is this like some books? I mean, Coursera?

No. This was,

Wow, this is super first principles.

Let's get a hit on definition. Yeah. What are world models to you?

And just to fact check, that is it needs to understand physics. It needs to understand if I'm building a type of material in it, how it interacts with some type of material.

It's a bit.

First, the numbers of agents in an environment. Second, DARE DOF, so their individual Duals

or freedom.

And so I think there's still a large market for traditional simulation engines, specifically in areas where video is really hard to get.

Is this exactly what the big labs are also

saying when they're talking to that?

I honestly haven't talked about the big to the big labs. Like, since we started working on them ourselves, I think people are more reserved with what they share with us.

Yeah. Of course. With him and said, that's probably another question. How would you contrast your version of world models with Fei Fei Yang Mi Fun.

Right? They'll they'll start generating

Yeah. They just have a reuse date.

I I I think

We interviewed just before

Yeah. Because you are basically these are virtual items that then has all the normal physics applied to them.

Yes. Oh, yeah. So they we actually asked

them about this and they were like, yeah, it's possible, but we're they're choosing the spec.

Yeah. And you can also go from splat to frames. Right? I'm sure you can write, like, at some it's it wouldn't be easy. Like, you'd have to actually render out the environment, do the sure.

Resolution or size?

I think this is very consensus among low models people. Basically, everyone who heard

this is like stops looking at LMs and just goes through to low models. I would say that the main pushback, I asked this exact question to Nolan Brown from open the eye, and he was like,

well, b learning physical models. Right? So this basically, the

the difference between two c and m versus the SMB. Or what do you wanna put down here,

Ron? Yes. So

Whatever closest

to AGI. Yeah.

I think it's interesting and tempting that a

lot of tempting, it makes sense that you

can probably recover. It's sort of like you you are taking a step back. You're studying your branch of the ML research sheet,

but you

might they actually just end up recovering all the other tech stuff emergingly.

Yeah. Yeah. We can import a lot of that research. Right? It's it's a lot of that is

That's really cool on the on the research side. Let's talk about the stuff that GIS

what do you say more like the the I guess the the sort of research and products outputs. You mentioned the word customers. What are your target customers?

But to see steerable means like, well, I want you to go to share, figure anything else out in the pre.

Just as a side note, since

you mentioned Anthropic, any idea what they did on this to to solve coding? Yeah.

No. Out of any lab, I probably know Anthropic the least to go Yeah. I admired him, though.

Yeah. Well, the the the current working theory is that they had a

super lucky roll of the ducks. But well Alright. And and then it compounds from

there. That sounds like a nice story. I'm sure I

saw that.

Yeah. Okay. So why did the game developers want this?

Is this known is this a thing? Yeah.

For sure. For, like, Fortnite and whatever.

A lot of people work with this. Yeah. And so if if you're, like,

as a human, do I wanna play against bots?

That means you also have to sort of grade that skill level.

Power wash simulator?

There's power wash simulator where, like, actually, the behaviors that you'd want an agent to be able to perceive, they're all there. Okay.

Minus that saved.

Yeah. It's really it's really, like, how

seriously some gamers take truck simulator. If you haven't seen these tips, you should watch it.

Yeah. They buy the whole, like, truck driving set,

and they're doing the job of a truck driver.

Yeah. I mean, is it so,

you know, I I I used to think that quality self self driving, you kinda just need to play a lot of GTA five. Yeah. I mean, it's bad for this.

Yeah. Yeah. It's also the the term that comes to mind a lot is active learning. I don't know if you've you sort of identify with that.

It's not it got less cool for a bit, and now

it seems like down the uptrend a bit, which which obviously you have the best dataset for the sort of high intensity or you said negative, but I feel like

found negative. It could be negative ballpark of it.

Yeah. For sure. I think negative events is just because it's the most common term that people use for, like, if you're if you're Tesla, you want the crashes. You want, like

Right. Yeah. Right. Right.

Right. But but it's So gaming.

It's ball of beer. Yeah. So, you know, the the model that you saw obviously had really, really incredible moments, and and that was largely driven by the fact that

yeah. Yeah.

That that it had the large representation of people at their best.

Yes. Yeah.

And worst.

Yeah. Yeah. Yeah. Amazing.

Okay. Cool. Anyhow, anything else on the

customer development side that you wanna sort of flinch off?

Yeah. It's like kind of like

the foundation model that is a very good basis to start.

Yeah. You're gonna straight you're gonna give us frames and and likely some text. Or you'll license the

model to because they've been the one in post training.

People can engage with you

on the agent level, API level. Mhmm. People can engage with you on the sort of model level. Can they also buy data? No.

Alright. Yeah.

We don't sell data. Okay.

Cool. So that's the that's the business.

And is there a world in which

I I mean, I I think this is on

Yeah. You're gonna see us launch a few things on on Metal itself that are gonna blow your mind as a result of this this this agent. I'll I'll leave it the imagination for now.

If people try to figure it out, you know, and Yeah.

Whizzle.

Yeah. You could theoretically do a video gen, like a Sora,

like a what is what is that Instagram y? What's the meta one? Meta mules? Not not Reyes.

Vote wise? Yeah.

You could theoretically generate clips that nobody play, but even though it's gonna be viral.

Since the replay entered that world

Yeah, Can you sing more about the RL potential?

Cool. We covered a lot of it. Is there anything else that you wanna do before we sort of grapple with

the the the long term vision stuff?

need the ones that are able to.

Exactly. No one can replicate the models, right? And so for us, we really want to bring back like the original culture of open research, which is why we did the partnership with QTai in France.

I actually didn't. Yeah, we

Yeah. Yeah. Amazing. I primed you this as the the closing question. Yeah.

It's like it's a little bit of a no cost, 3330% of I didn't know. Yeah. So what does GI become in by the risk?

And so, 80% of ADAMS ADAMS interactions coming downstream from these types of Spasion and Borel foundation models, and then 100x more in simulation.

Or you can do it a lot faster

with interest. Amazing. Thank you for inviting us to your office. Yeah. And thank you for sharing them a bit while you're joining.

Thank you. Yeah.

Latent Space: The AI Engineer Podcast

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

0:00 / 0:00

World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Description

Navigate

Chat with Episode

Navigate

Chat with Episode