| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
This episode features Olivier Godement, Head of Product for Business Products at OpenAI, discussing the current state and future of AI adoption in enterprises, with a particular focus on the recent re...
Olivier Goodman is the head of product for enterprise at OpenAI. I'm Jacob Efron, and today on unsupervised learning, I got to ask Olivier all my top questions. We talked about the progress he's seen in AI for science and how OpenAI models are helping there. Talked about the scaffolding and harnesses that are emerging around models, the patterns he's seen in different spaces, and the extent to that he thinks it will converge, as well as have foundation model providers provide this for startups. We talked about his reaction to Andre Carpathi and his comments about agents still being a decade away.
We also talked about the next frontiers for models and what Olivier thinks will be most important to unlock further value in enterprises. Olivier also gave a lot of really great tips on what the best enterprises using OpenAI do to adopt new models, as well as fully get the most out of these products. Just an awesome conversation with a really brilliant mind. I think folks will really enjoy it. Without further ado, here's Olivia.
Well, thanks so much for coming on the podcast. Really appreciate it.
Thank you for having
me. And it's extra fun to get to do it in, you know, OpenAI HQ. I know. I'm probably the least qualified person to ever sit in like the demo day launch seat, and so it's fun.
You're in the mothership.
You know, lots of things to discuss today, but I figured maybe one place to start would be with, you know, 5.1. Obviously, you've got a new model, think some really interesting improvements on the personality side, on reasoning as a whole. Maybe just talk about how do these things actually come about, these sets of model improvements, and what have you seen so far that you're excited about and how people are using it?
Yeah, totally. So we shipped a bunch of new models, last week. Dbt 5.1 and dbt 5.1 codecs
Yeah.
For coding. Those models basically were, trained based on the feedback of dbt five. When DIPT five came around, people loved the intelligence, people loved the ability to follow instructions, the stability. People didn't like the speed. Like, the model was really good, like, you know, when it was thinking for a long time.
But, like, for more basic queries, GPT-five was too slow. And so that was one of the main sort of design goals of DP 5.1. Like, keep that intelligence, but try to, you know, compress, like, the thinking tokens, like, you know, as much as we can in order to respond, like, you know, way faster. I think with that regard, like, the goal has been achieved. We are seeing, people, like, you know, switch, like, pretty seamlessly between, like, you know, low thinking effort for, like, you know, fairly basic queries and, like, larger thinking efforts for, you know, more complex queries.
The Codex model has been gaining quite a bit of adoption as well. I think Codex is having a moment at
the It definitely seems like
it's having a moment. Developers are loving it. We're starting to see more and more frequent use. We're seeing companies as well adopt Codex
Did that surprise you? Or did they like, had you played around with it enough internally to know this was coming?
It's probably the model that we dog food the most internally, just by the virtue of, like, every single software engineer, like ResearchArt, OpenAI, like, use or has used Codex. I think the latest stat I think something like, you know, engineers, like, on the team, like, are able to push something like 70%, like, more PRs because of Codex. So, yeah. It's become it's become clear, like, you know, and sort of a, you know, part of the tool stack at OpenAI. Yeah.
Yeah.
No. Mean, it's been it's been impressive to see. And I guess any, like, you know, that kind of improvement in latency, know, the improvement in the overall codex models, any, like, new use cases you've seen unlocked or, like, fun ways that you've seen folks start to use these models?
I think most of the use case were pretty much what we expected but better. Coding, productivity, any like knowledge, like query, like, you know, retaining information, customer support, customer experience. Same but more essentially. I think the one domain which has surprised me quite a bit over the past few months, like these usage of DeepTify and 5.1 among the scientific community. We're seeing more and more reports of scientists and researchers use, LLMs in order to perform like their job.
Their job meaning like, you know, condensating, aggregating like scientific like literature and knowledge just to get faster like test hypothesis. It's just been like a really exciting. Because it's funny like when I joined OpenAI a couple years ago, like that was always like one of the goal of the companies. Like hey, how could it be if you could like accelerate scientific research? I would say for the first time in the past few months, I'm actually feeling that it's happening.
Of course, it's early like, you know, and you know, we have way to go. But for the first time, we're seeing like pretty stellar like, you know, scientists tell us like, yes, I have done my job faster. You know, I've done achieved the discovery that proved faster because of LLMs.
How much faster do you think this is anecdotally as you talk to these folks? What kind of speed up do you think we are at now? Is that an eval that you think about hill climbing on as these models get better?
Mark, Mark Chen, our chief research officer, was working with that, physicist working on black holes. And the physicist, like, gave it, like, a really hard task, which was, hey, here's a paper I just released. So it's not in training sets or anything. Just released. Try to essentially reproduce, like the math.
Yeah. And you took DPC5 Pro, I think something like thirty minutes, like to achieve it. And, you know, I'm not a physicist, I cannot judge. But a physicist like, you know, the author of the paper was telling us like it was like weeks of work. For a professional physicist essentially to achieve that.
So I think we're starting to see some clear glimmers of acceleration in that work.
On the one hand, coding support, science, it feels like there's been these really exciting overall progress in models and the abilities for them to do all these things. And then I think there's been this public conversation about where are we with these models. There was this big moment when Andre Karpathy went on Dorcasch's podcast and said basically, look, the industry is making too big of a jump, trying to pretend like some of the stuff's amazing, some of it is slop. It'll take, I think he said, at least a decade until AI can meaningfully automate entire jobs. I'm curious like your everyday working with enterprises like and seeing what these things kinda can't do.
What was your reaction to like
that overall take from from Andre? I mean, it's no secret that like building a really good agent is hard. Like we haven't reached the stage where you know, we just take like today essentially to automate like you know, any like you know, white collar job. I think we're starting to see, like, some quite strong automation use case in, like, a few specific fields. Coding is the one that comes to mind.
I think at that point, we reached the point where, you know, if I were to take away coding tools, AI coding tools from software engineers, otherwise, they would probably be a riot or something. People would probably send jobs, essentially. So that stuff is happening. I would say the model the automation is probably not yet at the level of automating completely the job of software engineer. But I think we have, a line of sight essentially to get there.
We talked about customer experience, like, you know, sales, customer support. We're starting to see, like, fairly strong cases of adoption. I've been working a bunch with the folks at, T Mobile. Yeah. The telecom company in The US to essentially provide like better experience like to our customers.
And we're starting to achieve like fairly good like results in terms of quality at a meaningful scale. So but that stuff is hard. Like, you know, on top of the model like having, like, a really good model that you train, like, you have to build, like, really good harness, like, you know, how do you connect the model to tools, like, how do pump the model? You have to build, like, really good, like, evaluations framework. And then you have to build, like, you know, some sort of a flywheel, like, human in the loop, like, to constantly improve, essentially, that model harness.
It's a lot of work, but my sense is, you know, we'll probably surprise in the next, like, year or two on, like, the amount of tasks that can be automated, really.
Are there any that you feel, any industries or end use cases that you feel like are on the cusp where you're I'd be surprised in a year or two if there wasn't a lot
more activity happening? Oh, my bet is often on life sciences. Yeah. Pharma companies. So I've been working a bunch with Amgen and a few others.
It's really interesting. Essentially, when you ask Amgen like, why do you exist as a company? Their goal is to design new drugs, essentially. Design and develop new drugs. There is roughly like two big chunks of work that goes into the drugs.
There is like the actual R and D, experiments, validation, you know scientists, and then there is more admin work. The admin work is huge. Like the time it takes from you know once you lock essentially the recipe of a drug to having that drug on the market. Months, sometimes years. And that's about like generating like really complex regulated documents, having committees that review it, having regulators review it.
So, you know, a lot of information sharing, transformation validation, which you know turns out like the models are pretty good at that. Like they are pretty good at aggregating, consolidating sort of like tons of structured and structured data, spotting like GIFs and you know, different changes like, you know, in the document. So we've working quite a bit with them. And so my hunch is, you know, once, of course like, regulatory industries are, you know, taking things like, you know, very carefully. But once we figure out, like, you know, a way to properly version, audit, release, like, you know, the model with, like, the right permission internally, we're going to see, like, you know, a benchmark option in the life sensors.
And the outcome will probably be more medication and new drugs essentially for people, which is pretty cool.
Yeah. Now, I'm sure, obviously, industry that's heavily regulated and requires lots of paperwork and submissions from other sets of documents is a pretty perfect LLM use case.
Yeah. And there are some all over the place. I was meeting with a pretty big investment firm the other day. Similarly, their job is to aggregate, analyze mountains of data in real time and try to make sense of it, essentially, to make judgment calls. You know?
And, again, like, models, like, superpower is to, you know, do, like, in a split second. Yeah. That's sort of a massive analysis. And so, we're starting to see in hedge funds, investment funds, like banks, quite a bit of adoption on
that It's interesting because one thing I'm struck by is so much of the application layer today feels like it's companies that go deep in some industry and they forward deploy and they're just like, okay, what are your problems? Oh, you have a bunch of documents, and you're trying to do this workflow, and they build those workflows. Even for all those examples you provided, T Mobile, Amgen, this investment firm, There's also startups, like Sierra might be wanting to serve that T Mobile thing, a company like Colate might wanna serve the Amgen use case. How do you think about when it makes sense for you guys to be the ones working super close with the customers and when it makes sense for the broader ecosystem?
Yeah. It's a good question. What I've learned over the past couple of years working on AI adoption among businesses and the enterprise is frankly the size and the depth of the of the problem. Like, once you pick any of them, like, you know, Amgen, T Mobile, like, BNY, like, you know, any company, like, the amount of complexity and, like, use cases that you have internally in order to, you know, operate at that scale with that level of quality, absolutely huge. And so I think we at OpenAI have no illusions that we're going to be like, you know, the only ones like to, you know, build like really good agents, like great products.
Like, like, part of the mission is like to enable that ecosystem like third parties, like, you know, to work better, with us. We're thinking on ways to achieve that in the product, like, you know, description wise. I think we we announced at dev day this year apps in JCPT Yeah. Which is one of the first, I think, like big vectors of, you know, partnership with ecosystem. Like, the feedback we've heard, especially from enterprises, is like, hey, employees love ChatGPT as a sort of a universal interface.
You know? They wanna do more in ChatGPT. And so and at the moment ChatGPT is pretty capped. Besides JGPT is pretty capped on just essentially open eye features. And there are like tons of startups, you know, that wanna build that specific feature of that specific market and just, you know, benefit from like the adoption like, you know, and the memory, like connectors of JGPT on day one.
So that's one vector, and I think we'll do more of those in the future.
Is pretty much every enterprise worker in the future like they open ChatGPT right when they start their day? Is that kind of how you envision?
I think so. It's hard to predict for sure. You know, I don't think ChatGPT is going to replace every tool. Like, you know, for sure, like, if I'm like a financial analyst, I'm going to spend my day in Excel or like, you know, some like tool which was purpose designed for my specific use case. But I do expect like Chateactity to become more and more like the first like place that you check-in the morning.
Yeah. I don't know how much you've been using Pulse. Like Pulse has been a pretty monumental feature for me. Like it's very on point now, you know, preparing my day like, hey, yesterday night that email came in, that meeting is coming up, like should prepare. By the way, that paper came up.
Like it's becoming like a really really accurate and like insightful like source of productive information. And so I think we'll see more and more of that. I think we'll see, like, more and more, like, people taking actions in Chaijipiti, like simple actions as well. So, yeah, my hunch is that Chaijipiti could become, like, you know, the sort of a first, like, website in the morning. And then, of course, of, you know, some deep workflows that you would double click and, like, you know, do it somewhere else, like, coding an ID or, like, doing, like, you know, data manipulation, like, in a spreadsheet.
One thing that's fascinating about the space is just as we're discovering the capability of these models, they also get way better. I think there's a question if we froze model capabilities today. We've only been playing with these models, GPT-four and above, for a few years. Does it feel like there's an endless amount of stuff to go build this current set of capabilities and we're just discovering more and more? Or are you still kind of like waiting for the next leap from the research team to unlock even further things?
Oh man, I feel like I always operate on like two or three different time horizons. Time horizon number one is just with, the current mode capabilities, like, you know, unlock, like, use cases which haven't been, like, you know, haven't been enabled yet. That could be as simple as, like, you know, having, like, the right harness, the right use case, right data, but, you know, current mode capabilities. Then I'm trying to, like, think, like, you know, a couple of months in advance. Like like I mentioned for TPU 5.1.
Like, for some of the use cases, like, you know, it takes, a little bit of, smart, like, post training on the right data to really make the experience, like, just better. Yeah. Like, you know, faster, more aligned, like, steerable. So that would be, like, the second time horizon. And then the third one is, like, you know, fundamental breakthrough, like, you know, in the shape of, like, a one essentially.
Like, hey, what could you achieve if the mall could reliably think for like thirty minutes essentially? That one is, you know, harder to predict because, you know, you don't predict like when research is going to, you know, by far is going to land. But for sure, like, we think pretty deeply about, you know, on the products that we release, the use cases that we enable with customers, like, are they going to meaningfully benefit essentially from a catalogue? And usually when that happens, like, that's where, you know, the magic essentially shows up.
Yeah. What are the next, like, model frontiers that you think about? Or you're like, oh, it'd be amazing when x or y. Oh, man. There are many
of them. From a customer, like, business use case perspective, I think once we crack continuous learning Yeah. Like, that probably will be, a very, very meaningful, like, change. I think of it very simply. Like I think of an agent as like, hey, I'm hiring an intern.
On day one, they have like a bunch of academic knowledge. They don't have a ton of practical knowledge on the job. Because, know, I have never been documenting like everything I do, you know, like you have to learn on the job essentially. And so I think the relationship with agents is very much going to be like, you know, based on like feedback annotations on like, hey, you you did that or you know, you responded that way. You know, you should do it slightly differently this time and the model, you know, incorporates that over time.
At the moment, like, you know, we are chiming through, you know, prompting and that sort of stuff. Right. But once the model is able to actually, like, updates, like, it's weights based on, you know, like human feedback, like, you know, inference time, like human feedback, I think that's going to unlock quite a few use cases. Coding, customer experience, finance, yeah, it's gonna be all over the place.
Basically, agent shows up as an intern each morning but with better and better instructions, and so Exactly.
It would
be nice if it learned
Based upon it and, you know, they are smarter as a result.
Yeah. I'm curious from like my seat, feels, you know, if you asked me a year ago, like the categories that really had product market fit in AI, would have said coding, customer support, maybe healthcare and legal too as applications. And if you ask me today, I still feel like those four are kind of like the dominant ones. There's been some interesting voice applications as well across domains. But you see this all the time on the ground with enterprises.
I guess you'd add life sciences. Anything else that you know, as you categorize you know, insane product market fit, kind of product market fit, still early. Like anything else that you feel like goes in the insane product market fit bug?
Yeah. Sometime I remember like I remind my team like when we talk about like coding, customer experience, like finance, those are gigantic markets. A 100%. Like the market for coding, the market for software, I mean, was chatting with the team like
We have no idea how big it is. We've never had such cheap Like,
how big is the time of like software, you know? Like, what we know for sure is that there is a shortage of software engineers and so, know, probably more than like the current pay of like software engineers. Probably way more than that. And so, frankly, the the way we think about it, I would say like the first like few years, like you know post like GP three, we're very much like you know spraying and like trying to see what sticks. Now I think we have a much better picture of the industries, use cases, domains on which we think one, you know, there is like a massive customer prime.
And number two, like the malls are going to keep improving and make it, you know, more efficient, essentially effective like in that market. And so my current philosophy frankly is to try to double down more on those markets. Of course, we expand like, you know, all the time. You know, we talked about science like, you know, I don't know if we'll do but we should probably do something, you know, on science given like, you know, the fact that we're receiving. But, yeah, on coding, like, can push it much further.
Customer support, like, you know, okay, you automated, like, you know, tier one tickets. Like, you know, what does it mean, like, go much further? What does mean to, like, you know, turn customer support into an actual, like, revenue maker for the company? Like, you know, have it, like, be way more personalized, that sort of stuff. So, I would say like similar domains but like going deeper and deeper.
Yeah. So you think a year from now we'll like kind of have a similar list of of the stuff?
I'm sure there'll be new domains. Like, you know, if you had asked me like a year ago, I'm not sure life sciences, like pharma, like healthcare was in there. Yeah. Now I can see it. Like, and some of it is, you know, not just like model capabilities.
It's like pure software and change management. Totally. Those enterprises, like, you know, have, like, massive systems, lots of employees. And so, you know, having like, you know, AI being adapted just thought essentially to work really well, like for those cases, is a lot of work. And so if you have a question, like, let's say we froze, like, you know, research, like, you know, for, a few years.
Yeah. There would be, like, you know, many years of, like, enterprise option that would still be quite valuable.
It And feels like in some of these industries, you just reach a tipping point where enough people are using it that it starts to become irresponsible not to. I feel like even customer support people were dipping their toes in and then some people had huge success with it and now it's become one of these things like if you're not trying one of these solutions, you know, what you almost doing?
Of course. Of course. I mean, that's what we see with like every market. And frankly, are like a few pioneers, enterprise startups, you know. There are companies essentially who are willing to take the risks.
It feels like shaky at first, like, you know, it requires like a lot of scaffolding and like, you know, the thing like barely stands. But at some point you can see essentially the thing working, like you have it and then, you know, everyone follows essentially. So, yeah, we see the same motion like across basically Does
it feel like the scaffolding and the harnesses that people are building are pretty similar across use cases? Like a common set of patterns for scaffolding in life sciences and support and coding? Or how bespoke to the end problem is the scaffolding?
I would say it's fairly bespoke at the moment which is a challenge frankly. The way I would put it is frankly people are trying to make it work whatever I take. Yeah. One agent, multiple agents, you know, some deterministic like gates like in between like, you know, people are trying like many different things. I think there hasn't been I mean, we're starting to see it but like there hasn't been like a really like standard like agent architecture or runtime, you know, which has been adopted across industries.
That's something that we are actually working on. Like Why don't you think there has been? I mean, it's only been three years.
Yeah. You see
what I mean? Like, took us I mean, I've been at the beginning for like two years and a half. The first year I was just trying to keep up with the goal and be like, okay, what the heck is going on? Like, you know, year two was like, okay, let's sit down like, you know, okay, what can we achieve? What is working, not working?
And now I think we're starting to converge again. I'm like, okay, we have a good idea of like, you know, customer problems, what is being done. Okay. What are the pieces like, you know, in the stack that, you know, if they were like standardized would like meaningfully accelerate like an option. So, My, you know, like, sort of sort of boring answer is like frankly just time.
But how are you thinking about that? Like, what that standard set of scaffolding might look like?
I think something that we're seeing is code and coding is a much more general purpose capability than just software engineering. Like, the models are really, really good at generating, writing code, and those
things Well, on an insane progress path too, so it seems like a great thing to bet on.
Exactly. Like writing script, executing like that thing in a tool, like in a shell, like, you know, having it back. So I think the whole, like, effort towards, like, you know, basically giving access to agents over computer is probably going to become, I think, like a standard, in the industry. W1. On the whole, like data, like API connections, I think NTP was like a regular standard.
I expect the industry to keep normalizing around it. I think on the whole agent to agent communication, there hasn't been yet, frankly, like a true breakthrough or a true standard that we see being actually used. Evaluation, I think we're getting much better at generating traces, evaluating them, trying to infer some improvements by other traces. So, it's a bit of a game of inches.
Yeah. But do you think, it sounds like behind what you're saying, it's like moving more of this stuff to code feels very helpful because it's the vector which these models are improving at the most rapid pace. So anything you can Any part of your scaffolding that you can move to that is probably quite helpful. Exactly. It's a
bit like a human, frankly. Like, you know, if you ask me to do some of my work like without a laptop. Yeah. I'm not useless, but you know, I'm not far away from me useless. That's usually give me like a laptop with like, you know, Internet, like, you know, a shell, like, you know, you know, an Yeet, good stuff, stuff.
Like, your capabilities are like, you know, like, way, way larger. And so, yeah, to the extent we can replicate essentially that model with with an agent, with a model, I think we are going to see a bunch of progress. And to your point, every progress that we do to the core coding capabilities of the models is going to basically have all those magnitude more impact as a result.
I guess it seems like when these models come out, everyone's in use case exploration mode, just finding all the things that the models can do. I'm wondering, does it feel like cost is a limiting factor today on some of I these heard you say that on VG2 and I was curious because where does it feel like there's, you know, there's interesting use cases at one price point but maybe not at another?
If I step back, the the sort of the the reduction of prices of cost that we've seen over the past two or three years, I think is basically unheard of. Technology. Like we reduced like the cost of a GT4 like level queries by like you know one to two others of magnitude like in two or three years. And you know without like sacrificing the margin or something like it's pure like compressing the model size, having like you know better like hardware, like you know being better at like networking together like GPUs. So you know every layer of the stack.
What I see like talking here with teams who are working you know each layer of the stack is that there are still ton of like room for optimization improvements. So you know we are not going to stop there. And then what I see on the business side is that you know for some very high stake use cases say you know coding. Yeah. The economics work.
Yeah. You know? Like it's so high leverage. Like, you know, to multiply by two like the productivity of your engineer. That's sure.
Paying like, you know, dozens, hundreds of dollars, you know, every month. Like, could be worth it. But then there are many like, you know, other use cases which are currently being blocked. Like if I were to take a very like trivial one, models are pretty good at personalization and they can understand intent and adjusting the content. So why is not every content website in the world like homepage like you're not like you know infused with LMS?
Yeah. I think I know the answer. Like you know like cost and public latency. And so, I very much see it as part of the OpenAI mission frankly. Yeah.
To keep driving the cost down. What I've seen is that, you know, I don't even know. Over the past two years, I think we've done, like, you know, dozens and dozens, like, of course cuts. Every time, like, you know, the increase in, like, the base, like, the volume is, like, larger than, you know, essentially the the the sort of the the the revenue, like, the price effect, you know. And so, that tells me that there is still a wild untapped demand, which is basically limited by cost.
Yeah. I'm really excited to see it play out on the store models too. It's like you have the API that's clear. It's gonna be a massive cost correction over time in those models, and I think people will find all sorts of
cool things to build. Exactly. Exactly. Yeah. If you look at some of our most successful agents at the moment, run for hours, tens and tens of minutes, That can get pretty expensive, like, quickly.
And so if you truly wanna move to a mode, where essentially, like, each of us has, 1,000 agents running in parallel, async, like, in the background pretty much all the time. Yeah. We have we need as industry to basically bring down the cost that, you know, every layer of the stack. And I have good conviction we'll get there.
Yeah. What about RFT? I feel like it's very in the zeitgeist these days. I feel like a lot of people seem to think there's been a real step change in improvement in efficacy of tweaking these models versus maybe the SFT paradigm. Are you actually, you work closely with enterprises.
Are you seeing folks kind of start to use this and where are we in the journey of like more and more people actually tweaking these models for their use cases?
We start to see it. It's not widely adopted yet. I mean, the way I think about the enterprise opportunity at the moment is that most of the market is catching up to the frontier. Like, you know, to our discussion earlier, like they haven't yet fully leveraged like the capabilities of DeepT 5.1 base. So that's like, you know, the gist of the market.
And then you have a few innovators who are clearly blocked at the frontier, know exactly what they need like to get there. You know, a couple of example in IoT. I was working with an accounting software, like firm, essentially to make extremely accurate accounting, tax accounting, essentially, analysis. And the model was not quite good, too slow essentially out of the box. And it took like, don't know, something like maybe a few dozen samples of like really high quality environments and regulators to update, improve the model by 20 or 30% on their goal, like, evil standard.
Which was essentially like the diff which allows them essentially to get from like not really valuable to actually valuable. And so, we're starting to see some innovators who are identifying those things, trying everything they can outside of like to make it work and not succeeding. And therefore having your knock with IoT. With that said, it's still a lot of work. Like, you have to build like a really high quality environments.
You have to really like, you know, make sure that your radar is good. Once you kick off, like, you know, a reinforcement learning job, like, you know, it takes like, you know, can take hours, can take days sometime. You know, it's more like heavy handed. I think the we released an API, the IT API, which I believe was probably the first one on the market in that. So, we're seeing quite a bit, you know, of like excitement.
But it's clearly not not yet like, you know, a mass market. Maybe never will, frankly, but it's fine. As long as we're pushing the frontier with some customers, I'm happy about it. Yeah.
Do you think over time most people end up using it or is it like you can kind of just tie yourself to the general model improvements and who needs to push the frontier six months ahead of that or twelve months ahead of that for most enterprises?
I I suspect that, like, most enterprises will have some RF use cases, but I suspect those RF use cases are not going to be, like, the gist of their business. Like, they will wanna innovate on, like, some aspect, you know, of the model, but far, like, the the the sort of the the the the the massive, like, automating your operation, making, like, your knowledge base, like, you know, maintaining it, like, updating it, like, I expect, like, the base model will be pretty pretty good, like, out of the box.
What about on the start up side? Because I feel like for the longest time, it seemed really silly to train your own model or spend too much time even on fine tuning, right? It didn't help that much. Now it does seem like RFD can push performance. For a while I would have said, hey, most AI startups, they need people that know how to use these models well but maybe not people that are super deep in doing all of them.
Yeah. Like, do you think that's changed in this paradigm?
It's a good question. I like to remind, like, startups and people of what we do at OpenAI. We post train a lot of models. Like, we fine tune a lot of the models. So, of course, I like to think of us as doing a really good job and like you're in the most like post train like, you know, for most use cases.
But of course, like, there are some use cases and like the model will be like, you know, less like, you know, well trained. Like, you know, it's going to behave like not exactly the right way. The style, the formatting, the tone, the conciseness, you know, will not be quite it. And so I expect like, you know, startups who are achieving like a certain scale or, you know, wanna get like headed to the next level of, capabilities we'll continue to fine tune. It is a lot of work.
Maybe at someday, like, you know, some point, we'll be able to achieve that continuous learning or, you know, some sort of, you know, automated fine tuning. We're not quite there yet. But, yeah, I do expect that a fraction of startups will continue to use fine tuning to push the envelope.
Yeah. As you think about the choices developers make in the models they use, I mean, obviously, there's the overall quality of these models. What else do you think will ultimately drive where startups and developers choose to build on?
Yeah. I I like to think of it as three big buckets. One is clearly model capabilities and behaviors. We'll decompose that. Number two is like cost latency.
Number three is like vibes, like trends, like Twitter essentially. And, you know, that's becoming more and more important because you're
That is like the eval these days. Right?
Like Exactly. Even like Gemini three comes
out today and I'm like, I I don't yeah. They all the percent seem good, but, like, let's see what Twitter says.
It's really hard to keep up frankly with like, you know, everything. And so, yeah, I'm starting to see like, you know, more and more like sales startups and like enterprises like, you know, like rely like way more on like, you know, specific like, you know, influencers or like media. I don't know. But anyway, capabilities, behavior. For sure, like the biggest, like, you know, bucket.
The trend I'm seeing pretty much all the time is, you know, people are not prematurely optimizing for, like, cost and latency. Like, the first goal is to make it work. Yeah. And then hopefully to make it work like you know cheaply and like fast enough. The trend I'm seeing here is that at some point like there is so much like sort of information you can cram into like academic benchmarks.
If I'm building a customer support like agent, sure like SuiteBench, like MMLU, like probably tells me like give me some indication but frankly like you know probably not that much. We're starting to see like some interesting like you know industry level benchmarks. Yeah. TauBench is like a good one in the services industry. We are trying like to build more and more of those like we released a few months ago GDP eval, which was essentially meant to analyze the model capabilities on real world, like, economic tasks.
So, but I would say here, like, the industry is probably like in a catch up mode and the startups, like, do not have the luxury to wait, like, for EVOS to come up. And so I'm seeing a lot of like frankly like testing like you know, yeah, basically qualitative testing, which is pretty interesting. I'm starting to see like you know among these startups like some people who have like such a good taste for nuances of models. You know, just like people like who are really good at writing or you know like painting. Like, you know, they're not necessarily able like to elaborate why, like the framework where we think.
But they have like, you know, that that sort of a that sense. I'm starting to see, like, the same happen, frankly, like, you know, models, which is pretty cool. Cost and latency, pretty important. Cost, my expectation is going to continue to be reduced, like, by multiples, like, you know, over the next, like, year or so. And then you have like, yeah, like, you know, Twitter vibes, I would say.
It's funny. I feel in a way like we are like reinventing, like, you know, why like Gartner and like other exist, which is at some point like you cannot like compare like all the accounting software in the world. Like, you know, you have to like trust someone. And so, yeah, I think we're getting at that stage and I think we'll stay there.
I like that Twitter's like a decentralized guard.
I don't care. Yeah. Yeah. That's a good way explain it. One thing
I thought was really interesting and you'll have to forgive me for referring to it, think it was in the context of Anthropic Models, but Cognition was talked about when four, five, Sonic came out, I think it was, and they had to move everything over, and actually required a ton of net new work from them. And I'm wondering, you work with all these enterprises, and then you have a new model like five or 5.1 come out. What does that process actually look like? And how do you imagine that changing or evolving over time?
That's part of the model fatigue, I think, which is the days where you could just hot swap essentially one API parameter from the one model to the next are basically gone for non trivial use cases. The idiosyncrasies of the models are becoming, especially among different providers, getting more and more distinct in a way. And so some models respond better to certain types of instructions. Some models have been pre trained, like a post trained, sorry, for specific tool like signature, specific tool names. Some models handle more or less very long context recall.
And so you have to essentially understand all the quirks essentially about models and then to adjust your prompt, your harness essentially to adjust to it. And it's a lot of work. And even among the top most sophisticated startups, what we see is that doing it like every time and doing it accurately is like hard. It takes a lot of work and so they would rather not do it unless like there's a meaningful change. And so we can imagine like in the enterprise like the primary job is not like you know to do implementation.
The primary job is like to, I don't know, design drugs or you know sell like you know cell phones. They are very much eager I think to move to a much more sort of a regular cadence with like clear change logs. In a way, feel we are sort of reinventing, rediscovering like you know, how do we like deploy software? Which is, if I drop every day to you like, you know, a new binary file, you're like, good luck. You're like, cool.
Okay. I'm question about it. Versus, okay. It's the Vue 1.1. Okay.
It's a major version. Here is the change log and you can expect one in three months. People just want like, you know, predictability and like, you know, transparency. Do you
think that over time the scaffolding evolves to be more generalizable such that it can better absorb changes in these models? Or is it always going to be like, hey, you're going have to take that week sprint to figure out what changed, what you need to adapt?
I think as we standardize that agent architecture, it has to become like more generalized in a way. I think one of the reason why at the moment like there is so much like sort of discrepancies across different models is that each like lab is like training like you know for like different harness, different purpose, different use cases essentially like their models. And there is not yet like a single like you know common architecture or like framework you know to follow. So my hope is that, yeah, at some point, the industry will converge to a much more universal agent framework to all definitions, something like MCP essentially, but across every single dimension of the agent. And that will make it easier essentially for customers to compare different models, to adopt new ones, to adopt multiple models for different use cases.
Do you have any advice for builders? What are the really good top teams do when they're experimenting with a new model and basically getting ready to shift everything to that new model? And any lessons for like the rest of the world?
Frankly, they take the time. Like, what I've seen is teams that are way too impatient and just want to, you know, Yeah. Call it like the hot swap essentially of like the mall part name. And I'm like oh shoot that mall sucks. Like it doesn't work as well.
So the best teams I would say have strong taste but come with an open mind. Take the time like you know to battle test like the model. Take the time to work with us as well. Thankfully like you know we control the posting and so you know if some teams you know is like really really like you know ganged up on like you know having that tool like specific tool and behave like in a specific way like, we can influence that. And so the more, like, specific feedback we receive with, specific examples, and we can actually tweak, like, for the next, like, snapshots, models for that.
So, usually, that's what happens. Yeah.
We haven't hit on, like, voice yet, It feels like that's one thing that's just changed so much from There's last so many people building with the real time API. How do you think about the next frontiers there? Where are you seeing a bunch of fit and where are you like, we still need to do x, y, or z to unlock the next set?
So it's interesting. GPT4O, which came I think in May or June, so a little more than a year ago, I think to me was like probably the second breakthrough in terms of like feeling the AGI after LGBT. I'm like man, like the world will not be the same essentially. Like having a model, being able to express that branch of like tone, emotions, and like to understand as well, like you know as a result from the human. Like such tone and emotion was like preliminary.
With that said, like, I think we clearly haven't like crossed like the the test yet for voice. For text at that point, I'm pretty sure. I wouldn't be able frankly like you know to distinguish between a human and a and a bot. Like you know on text. On voice like it still feels like the interruptions, you know the cadence, still not quite exactly it.
And so I think that's probably the next frontier on voice. Which is hey, like you know, make the models more intelligent. Cool. Okay. We'll do that.
But second, like, get to a point of, like, naturalness and expressiveness where basically, like, provided the same level of intelligence. Like, you're basically, like, you know, you're okay to be served by AI. It's just okay to be served by a human. I think we have a line of sight to get there. And so I expect, like, you know, that's going to be, like, breakthrough, like, to really, like, you know, cross the chasm.
We are starting to see, like, deployments. I come back to customer support because, know, that's one of the main, like, you know, sort of voice, like, calls essentially, like, you know, use cases in the world. We're starting to see, like, you know, meaningful deployments among, you know, tier one customer support calls, which is good for a couple of reasons. Number one is that the model is, like, infinitely patient. And so, know, for people who need more time like you know to have their current answer like you know the model can take like five minutes, ten minutes, as much as you can.
The second thing which I did not quite realize is how critical multilingual capabilities were for customer support. Because you know, let's say you're in The US. You're like you know a big retailer in The US. Like you know most of your customers speak English. Others will speak like Spanish, Chinese.
Like you know there's a long tail of languages. And if you have to staff like customer support agents for like every language like where are they? Like you're not gonna make it essentially. And so you have to make a really hard trade off and literally lose customers as a result. And so we've been seeing pretty strong results and reception from customers on those multilingual capabilities.
Yeah. We were talking before about how Codex has kind of like been the main character of the ecosystem the past months. And obviously just tremendous improvement there. How do you see that space playing out? And besides just the underlying capabilities of the models, do what you think will ultimately determine whether folks reach for Codex or Cloud Code or one of these other tools going forward?
It's a good question. The Codex team is incredible. To me, it's like the sort of epitome of a really small, talented team singularly focused on the use case and like cranking a little like whatever it takes. Like you know model, harness, integration, like data, you know. So they're really really good.
My read on how the software engineering space is going to evolve. At the moment, the models are really good at generating code and understanding code. They could be better, but you know, they've made meaningful progress. But when I think about software engineers, that's only part of the job of software engineer. Right?
Another part of job is like to be on call. Another part job is like to communicate with your teammate, to scope some changes, to make some tough like architectural decision, to duplicate some APIs. Know, like there is much more to it. And so I think that's you know on top of like improving the more capabilities on like writing like and addressing code. That second axis of collaboration.
Essentially, probably a major unlock to spread the benefits of AI more broadly. So that's probably one. A second one, frankly, is like maybe trivial but like just like for that to gain adoption in the enterprise. When I talk to so many enterprise who have not yet, essentially, like, you know, are still stuck on like the GitHub Copilot V1 essentially. Cause they've never like went through like all the security and like process to be like you know make sure that these agents can be used properly like in large code paces.
So yeah, my thinking you think
we're like close to enterprises being like alright time for these agentic coding tools or does it feel like god, there's three years of hurdles on the security or compliance side?
I think I'm starting to see a critical mass of enterprises who are truly leaning in and provisioning thousands, hundreds, thousands of licenses, like, you know, software engineers, like, letting, like, know, them, like, experiment, iterate on, like, some specific use cases. Yeah. My gut is, 2025. In a way, like you mentioned, like, 2024 is the year of coding. I think 2025 is the year of coding in the enterprise.
Like, I'm starting to see, like, meaningful adoption. Yeah.
Yeah. What's 2026?
I don't know, man. Like, PMs, you know, maybe like an 8PM, like, you know, two minute PM. I don't know. Yeah. My my bet would be like, you know, one, the models are like more reliable essentially, like to do like, you know, write and understand better like code.
And second, the models are getting really good at collaboration. And so you're starting to have like a more of a multidimensional, like, you know, AI software engineer that you can work with.
How much of, like,
the improvements in collaboration is just models getting better versus the harnesses you put around them? The two. I think my assessment is that it's getting harder and harder, like, to detangle, like, what is the model versus the harness. What I see is that some of the best agents out there are trained. Models are being trained for a specific harness.
I think that's why Codex is so good, frankly. So I think of them more and more as a symbiosis or something.
Guess it goes back to this question I was asking earlier about whether startups will train their own models. I wonder whether if there's a standard set of harnesses, they can leverage your models or if they have their own harness, you know, they'll ultimately need to train very specifically on that harness.
So that's what we try to do as much as we can, which is we open source the correct harness essentially. We open source like the actual code on GitHub. We open source like the tool like definition for people to be able like to fully utilize like, know, codecs abilities. And today, you know, if you wanted to use like codecs in a cursor like any other ID like you could essentially. So I think that's how the industry is going to evolve.
Like you know more providers are going to move from being just like model inference APIs essentially to providing both like the model, the harness, you know maybe some UI that's like Like the reference design for the harness basically and then everyone else can Exactly. Like a sort of a standard architecture, like, you know, a sort of a standard, like, blueprint essentially to use, like, the model to the best of the capabilities. If I step back, that's probably to me the biggest learnings, like, of the past three years of, like, you know, building for, like, developers businesses is, hey. You can't just drop new models like in an API. Like it's gonna be really hard for people to maximally utilize like the multi capabilities.
Unless you give them like more of a blueprint essentially, more like documentation or like you know a specific harness. It's hard to discover. Like, models are so beautiful and weird, like, in some way. Like, you know, unless you have, like, a really good recipe or, a ton of experience, like, interacting with models, it's hard to, like, massively leverage them.
And do you think that, like, in the future your average enterprise will be able to directly interact with the models and these harnesses themselves? It seems that there's this whole set of applications that are basically translation layers between here's capability of the models and here's your end industry and we sit in between.
My gut is enterprises are mostly going to buy harnesses. They are mostly going to buy solutions. I think there will be some exceptions. If you're talking about the use case, which is like the core business of the enterprise, I think there are some reason frankly to build your own harness. But you know, if you're like a retailer and you have to operate like sales, finance, IT, like frankly just like you buy software like why would you you know like I think people usually intend to underestimate, like, the amount of effort it takes, like, to do anything, frankly, like, in a week of rate.
Like, you know, rate level of quality. The same will be true or even, like, not true for, like, agents. And so, yeah, my bet would be, like, you know, buy versus build for most GCSEs.
Yeah. It'll be fascinating to see if the harnesses from the labs converge or like they're actually end up looking quite different and then as a result you kinda have to like figure out which harness ecosystem you wanna play in.
I don't know. It's a it's a fun game of like, you know, divergent convergence. Yeah. I don't know. Maybe I'm being naive but like, I do expect convergence will win like in the long term.
But you know, science research like, you know, you have to let a flower bloom essentially to then like figure out which one's the most promising and like, you know, like, yeah, like behind it.
Yeah. I mean, you obviously work with so many different kinds of enterprises. Yeah. When you go into like a net new enterprise, it's maybe a little newer to the game. Do you have like a cheat sheet of like, here's your like the the few things you really should do or know that you kind of take from your most complex customers or like, what you know, if you could distill or or you know, share some lessons from Totally.
The most sophisticated ones,
what would they be? Oh, at that point, think I worked with, I don't know, 200 plus like enterprises. We talk about T Mobile, Amgen, like you know, Salesforce, BNY, you know, many of them, Database, many others. My cheat sheet, there are many many tips and tricks. The first one is a classic like enterprise software but like if your data is a mess you'll be able to achieve nothing.
Like you know, like I could give you the most powerful coding agent if you're not able like to plug that coding agent like the right code bases, the right like identity, permissions like database. You know, it's gonna be really hard for the agent to be useful. And so frankly like you know, a lot of the work is just like explaining like hey, how do you structure your data if there is like no API, like no services, like how do you stand up like the right services, how to use MCP or not use MCP, how we authenticate like you know those requests, how do you log those requests. And so there is a lot of you know like data engineering you know, if you will, work to do. That would be step one.
Step two would be to really explain like you know to teams like you know how to evaluate models. I was, like, you know, talking earlier about, like, you know, people like vibe, like, evaluating, which, know, necessary to some degrees. But at some point, if you have, a production grade, like, you know, use case, like, have to evaluate it. And evaluation, like, rigorous evaluation, like, doesn't come naturally to, you know, most teams unless they've done it before. And so and I always tell them, like, you know, we often talk about OpenAI.
At OpenAI, about the teams that train models, you know, they're extremely important. We have, like, we evaluate models like full time and they are probably equally important frankly. And so, spending a lot of time to like document like golden sets, document like the procedures, like the SOPs, like the standard operating procedures and then deal like rigorous evaluations in order to make sure that you know you are like here climbing like the right direction. That would be number two. What's
the most common mistake you see on the on the eval side?
Too little or too much evals. Like too little, like, you know, you've done only, five, like, eval sets and, know, first customer comes in and, like, it's completely out of distribution. You're like, okay. Like, you know, what did I learn? I don't know.
I mean, one of my biggest findings, like, working with enterprises is most of the knowledge is in people's brains. Yeah. Like you usually come in and you're like customer support, I'm sure there is like whatever a Jira or confluence somewhere with like every procedures being written etcetera. That never happens. Like, you know, you're lucky if you have like 2030% of the procedure being written.
The rest is like, oh, you know, Sarah and Mark like know really well about it and you know you should talk to them. And so, as a result like building really good eval is like not so much about, you know, converting like text like to evals.
Yeah. It's finding
But finding the right people, finding the Sara and the John essentially while you know the ones who know about it. And so that's like a fairly like iterative process. Like you know you don't do it like on one day then you're done. Like you start to do it, you ship the thing, you're like okay that thing is not quite working. Let me talk to John, you know, why is that not working?
And then you build really well. I know I cut you off. Was there
a third one on like the things that are most common Yeah. On the
management. Yeah. Like, that technology is new to all of us. Maybe to me sometime, like, you know, I'm, like, weirded out by, you know, how powerful it is and sometimes, you know, how, like, unequivit is. And so, yeah, taking the time, like, to explain to teams, to customers, you know, how does it work, I cannot emphasize enough, you know, how critical it is.
Have you seen enterprises do anything cool with SOAR API yet? I know it's only been out for a little bit.
Yes, actually. We're seeing quite a bit of energy in particular in two industries. One is like ads slash content generation where people are creating some crazy personalized like you know content which is really fun. The second one is like more like the studios, the productions companies. That one's really interesting actually.
I I learned a lot about what it takes like to build like really good movies. And now like being able like to just show to someone in like thirty seconds what you have in mind and you know the sort of the the picture like imagery that you have in mind apparently like you know being like quite useful. Like you know for teams like to brainstorm like way more. So yeah, that's a fun use case. But yeah, video generation I think was still like in the early innings.
Frankly, video generation. It's expensive, it's slow and you know but we can start to see how it's going to truly transform, that's all of use cases, jobs are being performed. And so, quite excited to see what's up and next.
That's awesome. Well, I always like to end my interviews with a standard set of quick fire questions where we get your thoughts on some overly broad questions that we stuff into the end.
Yeah.
And so maybe to start, what's one thing that you think is overhyped and underhyped in the AI world today?
Oh shoot, I did not prepare for that one. Overhyped, underhyped. Underhyped, I keep coming back to science, dog design, dog discovery. I think we tend to talk less about it because for us who've been working in software like for a while, it's gonna come super naturally as a use case. But at the end of the day, like you know, if you look at like the arc of history, like that's like the substrate of like progress.
Totally. And so if we're able to accelerate if only by like 5%, like the rate of discovery, like the sort of, the the implications on like the rest of the economy and the technology are just like, you know, humongous. Yeah. And so, yes, making models, harnesses is extremely good for like, you know, scientists.
Are you gonna have to build a lab to do that? I mean, it feels like, to really build a bio model, need some sort of feedback loop of experiments.
Right? Lab, data, like, in people, like, you know, who are, like, at the at the intersection of, like, you know, LLMs and, like, you know, their field. So, you know, it's hard, but probably, like, you know, the highest, like, compounding, like, know, benefits.
What's one thing you've changed your
mind on in AI in the last year? The model is everything. A couple of reasons. Number one, just like we discussed, the hardnesses are getting extremely important as well. And you know it's getting harder and harder like to detangle the two.
But I do expect like those you know extremely powerful agents will have like you know extremely good hardnesses which are evolving like you know even faster than the models are. Yeah. That's important. The third thing is like in the enterprise, like the whole like goal of the model is like to get more and more data like to perform outputs. And so if you don't get like high quality data like you know the more output will not be that good.
And so I think equally important for AI to be adopted in the enterprise widely is a set of like standard infrastructure framework to present the right data at the right time to the model. So I think once we crack that as industry, we're probably going to see enterprise adoption like scale of it.
Do you think that most of the advances and harnesses will happen within the model companies because they're close to the models that are post trained on these things? Or do you think it will happen in the startup world?
I think we'll see a bunch of adoption in startup world as well. I think as a result of, like, open sourcing the harnesses and sort of giving, like, the recipe essentially to best utilize the models, like, people are going to innovate. They're going to see some quirks of the model, like, some ways essentially to tweak the harness definition to get, like, more results out of it. And so, yeah. I know I do expect a fair bit of innovation.
From the outside it feels like everything just always goes right at OpenAI. Anything looking back on your last two and a half years that you're like, oh, we got we got that thing pretty wrong.
Oh, man. Where do I start? I think the world has been pretty forgiving to OpenAI because we are first on many things. And so I think people, like, you know, have, like, a higher, like, sort of a tolerance or, you know, flexibility, like, you know, to to sort of a. What have we gotten wrong?
I mean, there are plenty of product features, like tools that we ship that, you know, did not, like, you know, find product market fit or, like, you know, did not, like, you know, utilize, like, the models as well as we could. We had some, like, definitions of, like, what an agent was, like, 2023, which was probably too early and didn't catch the adoption. And so we had to set some APIs. As a result, we invested a bunch in different kinds of audio technologies that did not really pan out as well, frankly. So, Frankly, I think at the end of the day, history remembers, you know, the sort of the successes.
But there is like a fair amount, like a healthy amount of, you know, like failures or experiments that didn't pan out, you know, in the in the meantime.
Yeah. What do think of Gemini three?
I haven't played with it yet. The benchmarks are clearly good. I saw some couple examples, visual examples that look like really strong. So it seems like the Google team like really cooked like a a great model. But, yeah, I cannot wait like to actually test it myself.
Yeah. What do you do to test?
Like, are you gonna I have like two or three like different kind of tests I do myself. Number one is usually on style, tone, like, you know, personality essentially. When you know, I do have like some personal questions which I like to ask by dumping like, you know, a lot of context into the models. That's one. Number two is the classic like, you know, generating like an app from scratch and, know, look at the front end, like, you know, testing it a bit.
Number three is I love to test like very long horizon capabilities like dumping like literally like hundreds of thousands of tokens in there and try to ask it like very hard question on like the the, you know, the second to last token there. That's awesome. Well, has
been a fascinating conversation. I want to make sure to leave the last word to you. The mic is yours. Where can folks go to learn more about you, about anything that OpenAI has shipped that you want to point folks to? I'll leave it to you.
We tweet a lot. We tweet. Maybe not enough, but we should tweet more, but yeah, on Twitter. Okay. Twitter OpenAI.
I'm also on Twitter. I respond to everyone. Any feedback, any future requests, just tweet at me and you know, I'll be happy to connect. Amazing. Well, thanks so much.
It's a ton of fun. Thanks so much.
Ep 79: OpenAI's Head of Product on How the Best Teams Build, Ship and Scale AI Products
Ask me anything about this podcast episode...
Try asking: