Episode	Podcast	Published	Duration	Status

Unsupervised Learning with Jacob Effron

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

December 15, 2025•48m•9,447 words•Edwin Chen, Jacob Effron

Description

Edwin Chen is the founder and CEO of Surge AI, the data infrastructure company behind nearly every major frontier model. Surge works with OpenAI, Anthropic, Meta, and Google, providing the high-qualit...

As the CEO of Surge, Edwin Chen gets a front row seat to everything happening in the foundation model ecosystem. Surge, which is worth a reported $24,000,000,000, works super closely with all the top labs on improving their models, and Edwin had a really interesting perspective on a bunch of different things. I'm Jacob Efron, and today on unsupervised learning, we talked about Edwin's learnings from working with the top labs and the divergent approaches he's seen them take so far. We talked about where models are today, his view on RL, RL environments, RL as a service, and that whole startup ecosystem. We talked about what it means to have model taste and what really good researchers do.

And we also hit on how Edwin's changes views from there being one model to rule them all to there being a constellation of different models. Just a fascinating conversation with someone who gets to see the future being created every day. Without further ado, here's Edwin. Well, Edwin, really excited to have you on the podcast. Thanks so much for coming on.

There's a lot of things I'm excited to explore today with you, but, one place I figured I'd start, I listened in prep for this to a to a few podcasts you've been on and a lot of things you said that were pretty interesting. Know, one thing you've talked about pretty consistently is just the consequences of choosing bad benchmarks as model builders. And I think you've specifically called out Ellen Marina and some of the other things that folks are optimizing for. And so I'd love if you could just elaborate a bit on the kind of consequences of some of these optimizations.

I think the way you should think about Ella Marina in particular is that when you optimize for Ella Marina, you're basically optimizing for clickbait. The mental model you should have in your head of Ellen Marina is that users go on to Ellen Marina, they issue a prompt, and then they get two responses, and then their goal is to basically vote on which one is better, but they actually aren't carefully reading her responses. What they're doing instead is, okay, they just scroll through the two responses. After two seconds, they just pick which one is better. And so they're not carefully reading them.

They're not seeing, oh, yeah, this one followed all of my instructions. This one was completely accurate. This one was well researched. It was high quality and so on. I did they're not doing any of that.

What they're doing inside is, again, they're they're reading through very, very quickly, like maybe one, two seconds, and they're like, okay, which one impressed me the most? What they're gonna optimize for is, okay, which response had more emojis? Which one caught my attention? And they're like, yeah, the thing that caught your attention will be a lot of emojis. It'll be a lot of formatting.

So responses that just contain a lot of markdown, a lot of headers, and so on and so on. They will just naturally prefer responses that seem that are that are longer. Like, yeah, if it's longer, then sure. It has a sheen of expertise to it. They're not they're not they're not reading these things through.

Like, for example, I was actually looking through one of the AlbumSys datasets recently. So they they published a bunch of these online, And I just saw all incredible errors. So one of the errors was, like, I think the prompt was, tell me the tell me all the divisors of fourteen fifty two. And one of the responses said, oh, there there's only one there's only one divisor of fourteen fifty two, and that's the number one. And the other one got it correctly correct.

It was or was like, I I think the problem was actually, what what are the what are all the divisors, under under six of fourteen fifty two? And, obviously, one, two, three, four, and six are all divisors, and so that was the only response. If But you look at the data, again, guess which one the reader prefer like, which which user preferred? The user preferred the one that was completely wrong. And, again, like, if you even think about it, like, this is a fairly simple mathematical question, but if you ask a more advanced prompt, what?

Are you going to like, research it? Are you gonna go fact check it yourself? No. You know, that's just not what you're you're gonna do as a user. And so what ends up happening is that users, again, they're they're basically optimizing for whichever one caught their attention.

It's almost like a tabloid.

It's interesting. It seems like something that pops up in in a bunch of different consumer use cases. I remember there was, like, maybe a study of ChatGPT for medical responses, I forgot, maybe a year ago or so. And they you know, the responses rated way higher than physician responses, but I think when they actually unpacked why, it was literally just because the responses were longer.

Yeah, exactly. That's one of the big things that we've seen. It's that even when you haven't intentionally trained on analysis data, maybe you put a bunch of your models on the site. Then, okay, you have 10 different models that are on the site, and you're just gonna, you know, a b test which one and pick pick the one that performs the best. Again, users just naturally prefer responses that are longer.

It's, like, the easiest way to think that response is high quality, and so we've seen a lot of models. I just think when you go through that process, you just naturally end up responses that are two, three, four x as, verbose as ones that that that aren't that aren't.

And, obviously, like, you know, LMS is, you know, isn't the only thing that people are optimizing for probably and and, you know, leaving things on the table. What are some other examples you've seen of where not optimizing for the right thing really leaves a model on the wrong track?

I think the problems especially arise when you are both optimizing on the wrong kind of like, you're training on the wrong kind of data, you're optimizing for the or you're optimizing for the wrong objective function. And then in conjunction with that, you don't have the right measurements in place. So for example, earlier this year, we were actually starting to work with a new team. A lot of the researchers were telling us, yeah, they were suspecting that their models were getting worse, but they didn't have any quantitative evidence of it because they didn't have the right measurements in place. So we basically dug into the models for them.

And what we found is that actually over the past six to twelve months, their models had actually regressed. And so what happened was, the data that we're gathering from supposedly expert coders, the data gathering, the raters weren't executing the code. They weren't checking the code carefully to see that it was, it it was actually correct. And so what they were doing instead was they were, again, like, very similar to the LMSS use case, they were giving training data that was essentially full of flowery language and grandiose claims, like, oh, yeah. Like, here, I produced this amazing program for you that does ABC things, and they had taken the time to execute it.

You know, executing can can actually often be at times be pretty hard. You have to solve these libraries. You have to have all this infrastructure in place. You have to, make sure that you actually understand the language and so on and so on. And so the code was actually just completely either completely wrong or just full of all these subtle bugs that people wouldn't notice until later.

And so at the end of six to twelve months, because they didn't have any actual they didn't have any actual measurements in place to see whether or not the the models were actually improving or not. They were they were optimizing for this. And it's kinda crazy because, really, the the whole industry is moving forward. Coding is such a such a, such a hot area where everybody else is progressing. And then whatever teams is basically making negative progress because you don't have the right data, don't have right measurements.

I think it's a real concern where enough people in industry just aren't paying attention to the quality of data that they're receiving and whether or not they're measuring the right things.

I mean, it's interesting how how things can actually go south, given all the model improvements that are happening if you're taking shortcuts on the, on the data side. And then also, you know, I was struck by in that in that example, a big part of the problem is just not even knowing for six months, if your if your models are getting better. And I think this is something interesting that we've seen across the industry now where it's not always clear, like, you know, I think there was a generation of model improvements where it was, like, blindingly obvious, that models were getting better in some ways. And now it feels like it's a little more challenging to tell at times. What are the best companies doing to figure out, hey, on a month to month basis, are our models getting better?

So historically, all of the companies that paid really hard attention to benchmarks, like these very high academic benchmarks that have been created by the research community, And not all the labs realized the problems with, basically benchmarks. It's almost, like, very easy to optimize for benchmarks. Like, models, at the end of the day, they're very good at hill climbing, very concrete, very specific, very narrowly defined objective functions. And so what end what would end up happening is that the models would make a ton of progress on on on the benchmarks. And sometimes it would be fake because the, you know, the benchmark data would actually be in training data and people wouldn't realize.

Or even if it wasn't fake per se, like they were actually proving on the benchmarks, what they wouldn't realize is that it's because you've narrowly focused on benchmarks but don't have measurements in place outside of the benchmarks, people wouldn't realize that, yeah, sure, they got really, really, really good at some narrowly defined problem, their models were actually getting worse, kind of like on more real world problems. Like, like, maybe an easy example would be imagine optimizing for an SAT. Yeah. Like, you as a high school student, sure, you can spend thousands of, like, whatever it is, hundreds of hours, optimizing for the SAT. SAT is a very, very narrowly defined set of problems.

Like, sure, it's like, what is it? It's like reading comprehension and analogies and, like, vocabulary. But it's not really measuring your ability to, I don't know, like, derive well or your ability to perform complex problem solving in your real world or ability to do all these other things that are outside of the SE2's domain. And so what ended up happening is it like, again, like, we would actually see companies, like a lot of Frontier Labs, they would suddenly get all these impressive scores on the benchmarks and, we would just play around with the models ourselves, be like, no, your model suddenly got a lot worse. Like, even before they told us that that they were, optimized for the benchmarks, it became prevalent enough that every now and then every now and then, we would start playing around with, like, a model that one of the frontier labs would give us.

It was like, did like, why did this suddenly drop in quality? And I'm like, oh, you guys probably gather a lot of synthetic data to improve on your benchmarks. And, blast them, like, yeah, is that what you guys have seen? They're like, yeah. Like, we suddenly doubled our performance on x y z benchmark, and we didn't realize all all all the other consequences.

So I think that's a real concern, and then, like, basically what the best of our labs have realized is that the only way to measure their, the performance of their models is to run, proper human evals, and the way it works is you're essentially asking people to again, like high quality people, people who are very, really diligent, people who are paying attention to the content of the, of the responses, but then are also really sophisticated and have a lot of taste make sure, yeah, like, this is a fun personality. This is the kind of style that, maybe, like, the Frontier Labs want their when their models emulate. And it's basically mimicking the kinda like the the real user, real world experience, like, the diversity, all the complexity, all, like, the messiness of the real world without the artificial constraints of benchmarks. And then then it's also laying on to quality from, again, like, really, really, experienced and really, really trusted raters as opposed to just anonymous people who aren't paying attention to all on sys, like, the this this, this process of going through these rigorous human evals, that's that's been a gold standard for all of Frontier lab.

I mean, beyond, like, us paying a little more attention maybe, like, what have you noticed about what makes a really good evaluator have taste?

So at high level, it might be the three these, the following three things. One would be certainly expertise in of itself. Like, if you're judging a, like, yeah, like a like a lot of our, like a lot of our, evaluation sets, they're very, very advanced. So they might be measuring a model's ability to perform algebraic topology research, or they might be measuring a model's ability to, like, use PyTorch. And so first of all, you just need, like, really, really intentional people with a lot of expertise to the domain that they're evaluating.

So that's one one piece. Another would be kind of this notion of sophistication and taste. It's like, okay. At the end of the day, you want models to be, like, when you write when you're asking to write code, you don't just care about correctness. You care about, is it cold we're out what we're in?

Is it well designed? Or, like, in the area of, like, you know, essays or creative writing? Yeah. Is was this a really, really well written essay that introduced new ideas, had great prose, and just doesn't feel like AI slob. So there's this notion of sophistication.

Another one would be this notion of kinda creativity. I think people often underestimate how important prompts are. Like, people think of evaluating, like, evaluating models. It's like, okay. Yeah.

You're evaluating the response. But in order to measure the models properly, you need to have, kinda like prompts that span the entire distribution of what you want your models to be good at. And so, like like, within creative writing, for example, if you just ask the model a thousand like, if your prompt set is literally just a thousand thousand stories or a thousand essays, and you're all phrased in the same way as opposed to, know, like, the the the long, long tail of how people actually interact with models in the real world, that's just not gonna cut it. I And think people often underestimate often underestimate how difficult it is to create good prompts and to be creative. It's almost like that phenomenon where what what what is it?

If you just ask me to, like, name 50 foods, it's like, it's actually really, really hard for me to do that until, then once I think fully hard or I force myself to, maybe constrain constrain the list of foods in various ways. In the same way, it's, like, very, very hard, surprisingly, to be creative in this fashion. So, anyways, like, I'll I'll say creativity is the middle piece. Yeah. Finally, the fourth piece is just, like, the ability to follow instructions.

So, like, when we are asking you to or when French dealers are asking you to evaluate the models, they often have very specific criteria that they have in mind. Like, okay. They have a certain, I don't know, style guide, or they have a certain personality they want the model to follow, Would they care about x y z criteria more than, you know, a b c criteria? And so, oftentimes, these are very, complex instructions, and so people just need to be, very, very, very, very good at following them.

You know, I guess moving to, one of the main themes I feel like of of model improvement these days, it feels like RL environments, reward models around them are are all the rage these days. Can you talk a bit about that kind of transition and, you know, how you've thought about, supporting your customers with it?

Yeah. Definitely. So the way I think about R environments is that they're kind of a continuation. I mean, there's almost just, a net the the the next step in, in training paradigms. So in the same way that historically, a lot of work has gone into, like, SFT and then RHF and then verifiers, RO environments are kind of just like the next step in that progression and, you know, I do believe there'll be other steps in the future.

And I think RO environments are also really, really interesting because, like, for example, we've actually been working on our environments for, quite a while now, like maybe one or two years. And so it's actually interesting that the Rust industry has only just started picking up with them. Like a lot of the teams that we work with, especially, like, there's a really, really amazing, agents team at Meta. We've actually been working with them on our own environments for over year, so this is a team that created the GAIA benchmark. This is basically their agents team that also just, basically just open sourced their agent research environment, agent R environment platform, and so they, I think, really, really saw the the the wave of of the future.

So, like, our environments are kind of, like, interesting and different. If I think about what we've had to build in order to support our environments, like, what you need when you're creating our environments is the following. So, like, one, you need to have these really, really rich worlds that are basically simulations of, you know, the real world as best as you can. So just in the same way that, so as I mentioned earlier that creating prompts can be surprisingly hard because you want them to be creative and diverse and rich in real world as opposed to, you know, these very, very synthetic prompts that you find in benchmarks. Or you, or, like, just maybe what will naturally happen is if you don't have any constraints on diversity and creativity, just let people create prompts willy nilly, you just don't get much diversity.

So in the same way, when you're creating these worlds, you also really need to have a bunch of really, like, kind of like underlying entities supporting the world. When we create these worlds, we have, we basically populate them with, people and businesses and tools and interactions between these people and, you know, messages, Slack, Sec messages, emails, calendar events, etcetera, etcetera, and we want all of these to basically mimic the real world in very interesting and complex ways, and so a big part of our efforts are, okay, so how do we build the tooling? Do we build the infrastructure? How do we build the quality control measurements? How do we build the data measurements to make sure that this is happening.

And then once we create these worlds, we need to, basically create all the tools that exist in these worlds. So these could be the MCP servers that models are accessing. These could be the browsers or the code that the models are executing. So we just need make sure that we have the, like, the underlying infrastructure for the models to basically run within this world and to be able to execute the prompts. And then we need to, you know, create the prompts themselves.

We need to actually test how models are performing on these. We wanna make sure that, we're basically coming up with tasks that test the the limits of all these frontier models. There's also, like, a very, very big measurement and almost like an introspection, aspect where once we discover that a model has failed, we wanna dig in and understand why. So, yeah, there's lot of as well. Like, really, really interesting infrastructure and tooling that, that we had to go to support this.

I mean, you've been doing this work for a while now. I guess, what have you learned along the way or what surprised you or maybe something that you initially got wrong in the way you set these things up

and have gotten better at? One interesting maybe viewpoint that we've tried to kinda propose is that it's actually very important to, pay attention to model trajectories to understand why they are succeeding and why they are failing. People often underestimate the amount to which models can reward hack themselves to the correct answer. And I feel like there's a lot of

very funny examples of that.

Yeah. Like, the I think we've like, one of my favorite things is, like, looking at looking at some of these examples because they, yeah, they often just perform in all these crazy ways. And then similarly, I think people have actually underestimated the kind of myriad different ways that models can fail and what that says at the the model's underlying capabilities. Like, think people have this, like, maybe notion in their head that, okay, you know, I'm just gonna give a final reward and, also, once I train an FRON reward, everything's gonna work out. I think what happens in practice is that, again, models can just deviate in these very, very odd ways or they can show different types of intelligence depending on what the underlying capabilities are, and if you don't shore up those underlying capabilities enough, models may seem to perform well in the short term, but, like, you're you're basically gonna face a lot of problems down your down your own.

This feels like a theme in a lot of the, the work you do with the labs. Right? Like, there's there's easy things to hill climb on, but if you're not, like, doing it in a deeply thoughtful way, with the right evaluations, it doesn't actually serve the end purpose you're going for.

All these things tie together because it's often the thing that we've seen with benchmarks where, like, people will kind of end up hacking these benchmarks, and so they think that they're making progress because, you know, these these numbers that everybody's paying attention to are going up when, just the underlying model isn't becoming materially more intelligent. So it yeah. Like like, think all of these things tie together.

And then obviously, you know, I think, alongside RL environments becoming a more consensus way to improve models, there's been, like, a flurry of of net new startup activities. Right? I feel like there's probably 30 y c companies trying to do, build RL environments. There's lot of these like RL as a service companies. You know, I wonder what you what you make of that activity one, and I and I guess as a follow-up to that, it feels like every time there's a transition in maybe the main type of ways labs are improving their models, Some people see that as an opportunity to to step in, but obviously, there's persistence from people that were there in the previous wave.

And so wondering, you know, how you kinda reflect on that.

I think one of the things that always drives me crazy about Silicon Valley is that there's this pivot culture that people have in their minds where people are, just constantly pivoting to whatever seems the latest hot topic or whatever seems to drive the highest valuations instead of building things that they already materially believe in. And it's almost like this, like this funny thing where I think a lot of people in Silicon Valley, Dale, maybe, you know, don't talk don't talk shit about Wall Street because, oh, Wall Street, what are you doing? You're just, you're just chasing money at the end of the day. But what, you know, what was Silicon Valley doing? Yeah, you're just chasing the same thing.

You're chasing valuations, you're chasing, VCs, you're pivoting not because you have some amazing great idea, but just because that's what, like, YC told you to do, in order to achieve private market fit and get that you know, get some revenue that you can show show show VCs when you're when you're fundraising. So I yeah. I I I think I really really decide the culture.

Well, and what I'm struck by is obviously, it it seems, you know, it's it's not clear to anyone how long, you know, RL environments are are the flavor of the day to improve models and maybe that persists for a while, maybe it it doesn't. But, you know, I think, what you've, you know, clearly shown the ability to do over time is like whatever the way that models are being improved, like you have an offering that then helps support that. And obviously, I'm sure a lot of that comes from being just deeply embedded in your customers. I I could imagine some folks saying, look, building an RL environment is so different than the first few acts of surge, like, you know, to to what extent do they have the right to go do that versus the the 30 new entrants? I'm curious, like, how on the ground that's, that's felt.

Like, if you think about what our environments involve, they involve the the kind of the following, following three things. So one is, again, like, our environments are just, the next iteration or the current iteration of the data that is needed to train, to to to enable AGI. And so that, I mean, that just fits obviously with our fundamental thesis. Like, at the end day, we just wanna create whatever data is needed to to enable that. So so that's a big, big piece of it.

Second, our environments require a lot of tooling, like I mentioned earlier. So it requires the tooling to create all these tools, to run the models, to measure them, to analyze them, and so on and so on. And, I mean, that's not any different from the fact that, yeah, for RHF, you needed tooling as well. So sure, it's a different kind of tooling but, all of those pieces are the same. Even when we've done RHF, yeah, you need a lot of tooling to ensure that, you're able to analyze the models as they, progress throughout the conversation.

You need to understand the wins and the fails. You need to make sure that, or surges, the people who are creating these prompts and evaluating responses, you need to make sure that they're able to do all these things in a really high quality and diverse way. Yeah, so you obviously need a lot of tooling to support all this. Again, this is very different from, I think, a lot of the other companies in our space where they are essentially just staffing agencies and so they historically haven't built any technology. But we've always been a technology company first and foremost and, yeah, so it's just another type of tooling the same way that you're sure any technology companies build tooling.

And the third piece of it is that, I think people, maybe some of these other startups, they haven't quite realized that creating our environments is all about, getting really, really rich, complex, creative data and there's just no other way to do it beyond using humans. Like, if you think about a lot of the, like, even think, like, SuiteBench. What is SuiteBench? Well, it's a collection of, PRs that were created by humans and then, like, SuiteBench itself wasn't quite clean enough, so people need to build SuiteBenchVerify and that's, you know, people, basically taking the Subedge problems and evaluating them and cleaning it up in various ways. And so, just in the same way, I think we fundamentally believe that creating our environments is a human data problem that just requires lot of technology.

I think some people probably just assume quality is synonymous with, like, credentials. Right? And they're like, well, if I have a PhD in something labeling, you know, or or spending time on on evaluating something, like, of course, that's high quality. But, like, maybe what help us understand, like, an example of something where you actually have someone that seems credentialed on paper to improve models, but it's not happening in practice.

An example I always love to give is that take you take Hemingway. Hemingway didn't have a PhD. I don't even know if he completed college. And, I mean, what we're looking for is, yeah, like, we want the greatest people in the world at every skill regardless of their credential. So even just think about, like, who works at Google, Right?

Google doesn't just hire people based off of what school they have in their resume or degree they have in their resume, and that's not how you progress at a company like Google or any other company. The way you progress is based off the actual work that you do. And one of the interesting things about us as a platform is that, yeah, we have a technology platform that looks at all the data they are creating and then measure it measures it. And so we gather, like, you know, basically millions of signals on our workers every day. We see the types of data they're producing.

And so it's almost like the most meritocratic thing possible that you can imagine. So as opposed to, okay, you progress because you happen to have a degree from Harvard. Like, no. Like, we are going to actually measure what you do and, advance you based off of that. And sure.

Like, we like, we have a ton of Harvard students on our platform. We have a ton of PhDs on our platform. I think we're probably the biggest, like, biggest, like, source of PhDs in the world, but that just isn't sufficient, and and it's not sufficient for two reasons. So one is, like, even if you think about like, take take holders. If you're like an MIT grad who has a computer science degree and you're actually really, really good, well, you're probably not you're you're probably not actually gonna try to create really good data to train these models.

Instead, you're probably just gonna try to cheat the system. Right? Like, okay, you're a really good coder. You're you're, like, fascinated by, write teaming systems. You're fine you're fascinated by, like, adversarial attacks.

Like, whatever you're trying to what you're gonna try to do is you're gonna try to find a way to cheat. And another part of it is, like, even if I think about all of the people who are in my class at MIT or if I think about the number of people that I've interviewed from, from MIT for Fersert itself, Like, honestly, half of them can't even code, right? Like, there's a very, very big difference between reading about something in a textbook and then having the street smarts to execute it. Like, again, this almost, like, ties back to, the the other things I was saying about performance on benchmarks versus performance in a real world. Like, a big problem that the Frontier Labs have had is that their models are almost, too textbook intelligent instead of, instead of having, like, the the street smarts to to do things in real world.

You know, you you alluded to earlier, I think you mentioned one of the, you know, labs is having you, you know, be at their at their internal big conference today. And obviously, it speaks to just how closely you work with those teams. You know, obviously, you had, I guess, a few months ago, you know, Meta decided to buy one of the vendors they they work with really closely in in scale. And and given the proximity of those relationships, I know part of that was also a a, you know, bringing some talent into the organization. But I'm curious, like, your reaction to that.

And and do you think over time, like, some of those things are are natural to happen given just the proximity of relationships?

I mean, I think the the scale acquisition was actually amazing for us because up until then, we'd been pretty under the radar. And so I think all of the top researchers of all of the labs already knew about us, so they knew that we had the highest quality by far. And they knew that we were the biggest and fastest. But the AI field has just been grossing so large that, you know, more and more people are entering the field every single day. And, so that basically the scale acquisition just put a spotlight on us that was really, really helpful for for expanding.

Like, I think we, we just we got so much, like, new demand from all these new teams overnight, and so that was really, really beneficial for us. You kinda

mentioned this this difference in, you know, maybe the culture of of folks are attracted to these types of companies and the things that that people are optimizing for. I'm wondering how that manifests itself and maybe the different decisions you've made for search today and also maybe the different trajectories that these businesses go in over time.

Good question. So I think it's changed or shaped us in a bunch of ways and maybe one of the most profound ones is in hiring. If I think about the culture that we're trying to instill and the type of people that we're trying to hire, it's people who are researchers at heart and people who just fundamentally care about data and AI. When I think about us as a company and what we are trying to build, I think of ourselves as a lot more like a research lab than, like, you know, just another Silicon Valley startup that's trying to chase, chase chase money and hype and and valuations. And one of the concrete ways that manifests is, again, if you think about this idea of hiring people who are fundamentally interested by research and data and enabling AGI as opposed to this type of Silicon Valley person who's kind of embodied by growth hackers and embodied by people who are just doing whatever it takes to increase your revenue.

Like, if think about the incentives of those types of people, what they will often do is they'll basically try to sell you things that you may not need, that they don't think will actually improve your models. I mean, essentially, they'll just try to act like a salesman. As opposed to digging deep into what you need, digging deep into the problems of your models, trying to make sure that you understand all the different ways that you should be measuring your models to make sure that they're actually improving, as opposed to kind just like selling you, like, essentially snake oil. So I think that, that belief in caring more about model progress over revenue, I think that that that would definitely just shape the company.

I I guess speaking of model progress, like, how how do you articulate, like how what's the path to models getting better from here? And, like, does it feel like there's consensus among, you know, most of the top labs around this, or do you actually see some pretty divergent approaches?

Yeah. I think there's been a lot more divergence than we expect. For all of the training paradigms out there, it almost feels like every front row lab has their own take on it. Sometimes those takes are wildly different. Sometimes they're just, maybe slight variations on each other, but there's, there's a lot more divergence now than

what I expected. What are, like, some of the key vectors where that divergence, exists?

I think at a high level, there are two ways in which the companies diverge. One is once they choose their objective function, is it, know, what what type of training algorithm, what type of training data are they going to gather? I can't speak too much about that, but I think an underestimated difference between all the frontier labs is their choice in what they optimize for and what they pay attention to. And so I can give a couple examples. It's like one example is this.

So I've alum, alumarina earlier. And I think one of the fascinating things to me is that some frontier labs, they've just chosen not to pay attention to it at all. And I think those frontier labs have done better because the frontier labs who have had to pay attention, what researchers at those labs have often told me, it's like, okay, like, researchers tell me, I hate LM Arena. They they, like, these researchers understand all of the ways in which optimizing for LM Arena will lead to, this negative progress. It will lead to models that hallucinate because LM Arena users don't care about hallucinations.

If anything, they love them because when your model hallucinates, that just makes it sound wild and crazy and and kinda like a very fun and dicing way, again, just like a tabloid. And so the fact that French hair labs, like some French hair labs, have had the something like the fortitude and the, just like this underlying thesis, underlying belief in what they see as the path to AGI, as opposed to, feeling like they need to, chase publicity and chase, chase, like, this, like, very hyped up and publicly visible leaderboard. Like, I think just the fact that certain Frontier Labs have, felt a freedom at 42 not to pay attention to that because they had such a core belief and what they're going for instead. I I think that that's actually really, really shaped a lot of model progress. And then I think another interesting divergence is, again, in the choice of objective that the frontier labs are trying to optimize for.

And I think you can see this clearest difference between OpenAI and Anthropic. Like, if think about OpenAI, what are they optimizing for now? It's almost like they are leaning more towards optimizing for user engagement. So really long sessions or amount of daily users as opposed to a company like Anthropic who might be optimizing for something more akin to productivity and, like, how much value you can attract, like how much, almost like GDP or how much productivity or time savings you can extract by interacting with the model. And I think that shapes the types of products that they build, it shapes the types of people that they attract, it shapes the capabilities of their models.

So I think it just really shapes them in their in their model ways. Yeah. We're starting to see that.

It's interesting point. So I think it it it feeds into this larger question of, you know, obviously, I think a lot of model progress to date has all kind of, you know, come back to one core large model that can do lots of different things. And, you know, to your point now, it feels like you could imagine models optimizing around consumer engagement and then a productivity model. I mean, I think you've you've seen this with voice. Right?

There's like the enterprise voice people want and then maybe more of like an engaging consumer voice. Over time, do you think that, like, there'll be one model that just is, you know, is is able to context switch across, like, whatever the the thing being optimized for? Or do we actually end up in a world where it's like, no, you probably want, like, this enterprise model or this even per industry models for finance, legal, or or other things?

Yeah. So I think this is one place where my thinking has diverged a lot. So I used to think that there would essentially be one model to build a model, because, yeah, okay. Sure. You have some super intelligent Like,

the ultimate ASI vision.

Yeah. Like, you have some super intelligent model, and it should be able to contact switch and adopt whatever whatever you want it to do. But, actually, I think, over the past year, I've started to realize that it's almost like every company should have a thesis on like, the world is just so rich. There's never going to be a one size fits all solution. Instead, every company or every lab or every, like, AI, it needs to have a thesis underlying it that, of, like, what will be useful in the real world and, what kinds of AI will best serve people and that thesis will shape how the model behaves.

Like, sure, like, two models can be it's almost like two models can be just as intelligent as each other, but they'll have their own personalities. They'll have different biases for how they answer particular questions. They'll have different ways in which they converse with you, and so on and so on. And so just in the same way that, like you think about, I don't know, Google and Facebook, if Google were to build a social media platform, it's going be very, very different from the way Facebook built social media platforms. If Facebook were to build a search engine, it would have a very different take from how Google built a search engine.

And so there's no, like, right or wrong answer, per se. It's just that different companies, different people, they have different fundamental beliefs and and, like, the things that are, like, useful and good for the world, and so I think I think the same thing will happen with AI. So what does

that mean for for your take on how many people should be building models? Or, like, do you think we'll see more model players over time or or convergence from the folks we do have today?

I definitely think that there should be more people building AI because, to that point, like, I think that many, many, many different types of theses are needed on what kinds of AI will be useful for the world, and I I just don't think anybody's figured that out yet.

Should companies be training their own models? Well, as in, like, you know, I'm a really large finance firm or a really large, you know, health care organization.

So I definitely think that eventually every every company should be trained at their own levels. And that's because these models will be so important to the world, and you want to you'll want to eventually deploy it into 99.9999% of use cases. And if you simply rely on models from the frontier labs, what you're optimizing for may not be what you're optimizing for. And just again, just because AI will be so important and you want to deploy it everywhere, I think you wanna get the the best value, best performance possible, yeah, you you should be training them.

Can you achieve that optimization through, like, just some good prompting or some light fine tuning, or do you think it actually requires, like, building somewhat from scratch?

So I think this, again, goes back to what I said earlier about having a fundamental thesis on how AI should serve your your customers and what types of AI you want to build. So if you have a strong thesis, it's almost like having a, like, a product thesis, but if you have a strong product as opposed to just building some commodity product, if you believe that you have some unique take on how the AI should behave, then yes, I think it makes absolute sense.

Obviously, we'll be we'll we'll be curious to see how the cost of doing that, you know, change over time because certainly the the trade off of of making that investment or having that unique take, you know, today, it's it's kind of relatively prohibitively expensive to get to the state of the art. But I imagine a lot of people could build these opinionated models with a thesis and still be six, twelve months behind state of the art and and and be okay. Yeah. Yeah. Exactly.

Like, I I don't think companies are quite ready right now given the state of I think both the companies themselves and the state of AI. But as AI gets better and better, I think it will be increasingly, increasingly important.

Obviously, it's very clear models are getting way better at coding, and these easily verifiable domains. You know, I think there was a time where, you know, you you'd use ChatGPT and a new model would come out and it'd be blindingly obvious, that the models have gotten better. I don't know if I I would necessarily say that's been the case over the last three, six months. You're you're obviously on the inside of this stuff. Do you feel like models are still getting better outside of coding right now?

Yeah. I definitely think they do, and in part, that's through all the evaluations that are running where we see this constant progress. But then it's also it's also true that, I think just the other day, I started using Claude a lot more for for writing in particular, and I was just shocked by how, how much better it was, today compared to compared to a couple months ago.

I do wanna make sure I hit on on on some of the multimodal, you know, models that are being built, you know, whether it's video, robotics, stuff being done in in bio. I'm sure you thought about some of this stuff. You know, to what extent is that interesting to you? Is it kind of like a similar set of problems, or or how would you characterize, what's similar and different there?

Yeah. So I think all these modalities are are fascinating to me. I think one the things that people don't realize is that we actually work very, very heavily across all of these spaces already. Like, I think, maybe 50% or more of of our work today, it's actually in domains outside of pure text. Oh, awesome.

So I actually do think it's fascinating. And, if you think about our thesis and what we're trying to enable, it's like this idea of we just want to enable AGI no matter what it takes. And, yeah, if we want AI that is useful out there in the real world and needs to understand all these capabilities, needs to operate across all these domains, and so we just wanna do whatever it takes to to make it happen.

What's, like, quality mean in the in the video context? I could certainly imagine, you know, I get I get, like, these text use cases, but, for video, you know, what's what's that mean to you guys?

Yeah. I think people often underestimate how how important quality is across even, across these modalities that people think are surprisingly simple. So one example is, again, even in the types of prompts that you're creating, you need a lot of creativity and, kinda like technology to make sure that you're exploring the full distribution of the space. I think people often underestimate that, because they just think that people can, kind of create prompts out of thin air that target the full distribution of a model's capabilities when it's actually spiraling difficult. One of our big goals is that we often try to teach people or try to, try to make sure that our customers understand.

It's that when you think about quality, you need to go beyond robotic instruction following and robotic correctness think about all of these, kind of like other implications of the prop itself. So, yeah, I I think people often underestimate what what quality means.

Yeah. I mean, what makes, like, either a video evaluator or, like, video itself, like, higher or lower quality if, you know, for as you think about maybe the problems that text models have run into with Ella Marina, I imagine there's similar traps on the video side. Like, what have you learned around that?

Yeah. So, again, I think it boils down to this notion of taste and sophistication that I've mentioned before. It's like, okay, sure. Like, you ask, Scorsese to film a video, create a video about, I don't know, a fish. Then you ask your high school arts graduate to do the same thing, like, you know, just pull over Grandpa's on a street.

Sure. Both people can, create a film about a fish, but Scorsese is probably gonna have a much better film about the fish. And, like, that's where that notion of taste and sophistication and creativity and just, going above and beyond. Because when you think about what you want from, like, from models, it's not just the ability to to literally follow your instructions and do whatever you say. It's to, is to kinda craft something that will blow your mind, something that feels imaginative and creative and, kinda raises raises the bar.

I think that that that's what we're we're struggling to do.

And then, like, robotics and bio in these spaces that require, like, a hardware component for data collection, do you think those are natural extensions for companies like you guys or maybe you're already working in them? Or is that, like, a totally separate, you know, set of companies that might pop up and do that?

Way I think about it is we wanna do whatever it takes to enable the data that that, that is gonna help, accelerate AGI. And sometimes it involves, sometimes it involves building new tools. Sometimes it involves buying new hardware equipment, sometimes it involves, you know, expansion to whatever space. We are yeah. We're a technology company, we're gonna do whatever it takes to to make that happen as opposed to being some, know, like, narrowly constrained company, because we just pivoted in into the area, and, we're not really thinking longer term about like like, we are thinking like, we, in contrast to some of the other companies, we are thinking longer term about everything that's needed to to achieve all these things.

Amazing. Well, we always like to enter interviews with just a quick fire round where we get your take on on on a standard set of topics. And so maybe to we we've hit on this actually a few times, but I'm curious. One thing that you've changed your mind on in AI in the last year.

Probably the biggest thing is, this idea that I used to think that there would be one model to rule them all, and now I actually do see how these different, kind product opinions, AI opinions will will shape every eye going forward.

Obviously, seems like, Search has just been a series of incredible wins and you've built, like, an amazing company. But I'm curious in reflecting back, what's the biggest mistake you've made in building the

So my background has always been in research and data science, and so I used to love publishing and I used to love blogging. I used to love sharing all of our insights. And we did that early on and then I somehow just got too busy to do it and I can imagine how. And I really miss that, like, this idea of, like, teaching the world and and, like, kinda sharing sharing our viewpoints on the industry and what needs to change or what needs to happen in order to make sure that we're on a good path. And so, I think the biggest mistake is that we kind of stopped, stopped publishing as much in the past, like, two or three years and so, I'm hoping to fix that now and, Yeah.

We're basically If you were if you were to, like, you know, get a week of vacation where you could just sit back and write, like, a really long piece, like, what what would you write about or what what's kinda, like, most top of mind for you?

So the most top of thing top of mind thing to me really is this concept of objective functions and what every frontier lab is optimizing for. Like, think it's surprisingly subtle and has surprisingly far reaching consequences. Like, you know, are you optimizing for engagement? Are you also optimizing for usefulness? Are are are you optimizing for number of users?

Are you optimizing for GDV? Like, what like, whatever is that in that concept, has very, very, very far reaching consequences for the industry and for AI at large.

What would you optimize for if you were running one of the labs?

So I would optimize for this I haven't quite crystallized it yet, but it's this notion of a month later, would you be happy that you had this interaction with this model? Would it have almost like changed your life in some way? Like, the more moments that we can get like that, and could it change your life because it, maybe you were asking about a vacation and it introduced you to a serendipitous new location that you've never thought about before or maybe you had a medical question with it and you didn't quite maybe you didn't quite know how to phrase a medical question but then the AI serendipitously, like, noticed something, and, you know, like, taught you something that you wouldn't have figured out otherwise. Like that, I think, is, one of the things I'm thinking about from right now.

Struck by, like, how much of this, of of, like, these questions we're asking in AI just bring up, like, challenges that we've always had as a society. Right? I mean, you were alluding to the SAT challenge earlier and, like, that's certainly you know, it's an imperfect way of of measuring intelligence, and but we've never really found, much better ways. And similarly, even here, it's you know, there's been lots of talk about what technology should be optimized on and what we should be trying to improve in people's lives. And, you know, it's a hard a hard question to answer, but, obviously, an ever important one as these, as these models do get better and are going to hill climb on on whatever it is we are we are optimizing for.

Yeah. Yeah. Exactly. I think, like, a lot of the way I think about AI is and maybe, like, the the worries and consequences of AI is analogous to the parallels with, social media.

Totally. Well, I've been this has a fascinating fascinating conversation. Conversation. I I wanna wanna make sure I leave the last word to you. Where can our listeners go to learn more about you, about Surge, and anything else?

The the mic is yours wherever you wanna point folks.

Yeah. So I would definitely suggest or blog. So we're starting to to blog a lot more, starting to share a lot more, insights and analysis. So I would definitely check check that out.

Amazing. Well, thanks so much. This was a ton of fun. Thanks so much.

Unsupervised Learning with Jacob Effron

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

0:00 / 0:00

View original episode →

It's almost like a tabloid.

I think it's a real concern where enough people in industry just aren't paying attention to the quality of data that they're receiving and whether or not they're measuring the right things.

I mean, beyond, like, us paying a little more attention maybe, like, what have you noticed about what makes a really good evaluator have taste?

Another one would be this notion of kinda creativity. I think people often underestimate how important prompts are. Like, people think of evaluating, like, evaluating models. It's like, okay. Yeah.

very funny examples of that.

And so wondering, you know, how you kinda reflect on that.

And and do you think over time, like, some of those things are are natural to happen given just the proximity of relationships?

Like, I think we, we just we got so much, like, new demand from all these new teams overnight, and so that was really, really beneficial for us. You kinda

what I expected. What are, like, some of the key vectors where that divergence, exists?

So I think it just really shapes them in their in their model ways. Yeah. We're starting to see that.

the ultimate ASI vision.

that mean for for your take on how many people should be building models? Or, like, do you think we'll see more model players over time or or convergence from the folks we do have today?

Should companies be training their own models? Well, as in, like, you know, I'm a really large finance firm or a really large, you know, health care organization.

Can you achieve that optimization through, like, just some good prompting or some light fine tuning, or do you think it actually requires, like, building somewhat from scratch?

What's, like, quality mean in the in the video context? I could certainly imagine, you know, I get I get, like, these text use cases, but, for video, you know, what's what's that mean to you guys?

I think that that that's what we're we're struggling to do.

Are you optimizing for GDV? Like, what like, whatever is that in that concept, has very, very, very far reaching consequences for the industry and for AI at large.

What would you optimize for if you were running one of the labs?

Yeah. Yeah. Exactly. I think, like, a lot of the way I think about AI is and maybe, like, the the worries and consequences of AI is analogous to the parallels with, social media.

The the mic is yours wherever you wanna point folks.

Yeah. So I would definitely suggest or blog. So we're starting to to blog a lot more, starting to share a lot more, insights and analysis. So I would definitely check check that out.

Amazing. Well, thanks so much. This was a ton of fun. Thanks so much.

Unsupervised Learning with Jacob Effron

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

0:00 / 0:00

Ep 80: CEO of Surge AI Edwin Chen on Why Frontier Labs Are Diverging, RL Environments & Developing Model Taste

Description

Navigate

Chat with Episode

Navigate

Chat with Episode