| Episode | Status |
|---|---|
| Episode | Status |
|---|---|
This is a two-part episode. The first ~30m covers the most important 2025 breakthroughs in polygenic embryo screening, while the second 30m focuses specifically on AI capabilities at the frontier of h...
And so, for example, there was a super wealthy Chinese guy, Xubo, who was profiled in this Wall Street Journal article. Now he reportedly has something like a 100 or more sons that have been born through surrogacy and IVF. Furthermore, it was reported that he prefers eggs from Jewish women. So he is a phyllosemite, not an antisemite. He is a phyllosemite.
He wants eggs from Jewish women. And, so this news got a lot of attention. If you can see my screen, you can see a little capture from a video. I think in that video, there are about a dozen sons. They're they're little kids, little toddlers.
Very cute. And, we'll see whether he really has a 100. I'm not sure that's really true, but that was what's was reported in the Wall Street Journal. This is a picture of Shubois from the Wall Street Journal article. Here's another snapshot, if you can see my screen, of, some of his sons.
They're all very cute. They all seem to be pretty similar in age, and no doubt he used donor eggs and surrogates. And who knows? Possibly polygenic prediction, polygenic selection of embryos to produce these sons. So I think 2025 marks a number of important changes in the landscape for polygenic selection of embryos, reproduct advanced reproductive technologies.
Welcome to the year end 2025 episode of Manifold. Today, I'm going to discuss two scientific topics. One is polygenic prediction and embryo selection, and the other one is AI or generative AI. And, I'm gonna cover some interesting and dramatic changes that have happened in these two areas in 2025. Gonna break this into two components.
Try to aim for about thirty minutes for each one. I will start with genomics. I'm going to have some slides on the screen, but as usual, I will try to make the episode as understandable as possible for people who are only getting the audio and can't see what I have on my screen. So let's dive right in and talk genomics. I haven't actually covered this very much on the podcast in a while largely because the situation has not really been changing very rapidly.
The field has sort of reached a fairly mature level of sophistication, which is largely data limited. So our improvement in quality of ability to predict phenotype from genotype is primarily data limited. We have good algorithms. We have plenty of compute, and it's just a matter of accumulating enough data to push things forward. But some interesting things happened in 2025 that I wanna go over.
Now on the screen, what I have is an illustration from a journal of the American Medical Association, JAMA. This is a, paper about a very big study, almost 30,000 women. And from the perspective of polygenic prediction or polygenic risk scores, one could view this as a validation of the predictive power of polygenic scores specifically for breast cancer. So in this study, an RCT type study, they assigned women into a group for whom screening was allocated based on the estimated risk that that individual woman had for breast cancer. And the biggest driver of that is the polygenic score for that woman.
So they have access to the DNA of an individual, and then they compute the polygenic risk score, which depends on about a 100 different loci or individual snips on the genome. And they then put that woman into a high risk, medium risk, low risk kind of category. And based on that, some of the women were allocated more resources, things like mammograms, biopsies, and the people who are low risk were allocated fewer resources in terms of additional screening. And then there was a control group for whom rather than risk allocated screening, they just did annual screening. So the second group is treated more or less the way women are treated currently under the standard of care, and the other population is a group in which the way those women are treated is DNA informed.
So that's the study that was conducted. We're not that interested in the study itself. It's just an example of a situation where they are starting to incorporate polygenic scores into adult care. In this case, they're using the polygenic risk predictor for breast cancer. Most women who are high risk for breast cancer, and typically those men would have a family history of breast cancer, most of those women are high risk because of the aggregate effect of many SNPs in the polygenic predictor.
So they have many different genetic loci in their DNA that increase their predisposition. For breast cancer, that is distinct from the fraction of a percent of women who are carriers of a rare mutation like BRCA. So those women are also high risk for breast cancer, but typically because of a single gene mutation. And and that is the minority of women who are at high risk for breast cancer. They're about ten times more women who are high risk for breast cancer than there are actual BRCA carriers in the population.
So in this study, what got a woman placed into the high risk category was typically a high polygenic score and not her BRCA status. And, again, I'm I'm going over this particular paper because it gives you an example of the different types of validation of polygenic scores that are going on in preparation for the rollout of the use of polygenic scores in adult medicine and adult health care. And so as a side effect of the study, this is a the graph that I have on the screen shows the cumulative incidence of breast cancer, but for each different risk category of women. So this bluish curve in which there is a much higher incidence, several times higher than for an average risk woman, that is the set of women that were classified as high risk primarily because of their polygenic score. And you can see there's meaningful differentiation between women that are high risk, medium risk, and low risk based entirely on a computation that depends on the status or the state of about a 100 SNPs in their a 100 loci in their genome.
And I just point to this as an example of a very well conducted study, large n, large study population, which showed efficacy or validity of polygenic risk prediction. And I belabor this point because when we talk to when we talk about polygenic risk scores for adults, generally, there isn't a kind of visceral or emotional reaction. But when we get into using this kind of technology for reproduction, for IVF, for embryo selection, then there is a typically or quite often an ideological, emotional response in which some subset of people who really should know better, but are just ideologically opposed to the use of this technology in reproduction for the selection of embryos. They will often claim that PGS doesn't work, that the scientific status of our ability to predict disease risk or to predict a phenotype like height or intelligence is not well established. And in fact, it's quite the opposite.
It's so well established that it's about to be integrated into health care for millions of Americans. I should add that in this study, the polygenic predictor for breast cancer that was used is one produced by an academic consortium. But the leading company that provides breast cancer genetic screens for millions of women worldwide. That company is called Myriad. Myriad also uses polygenic risk prediction in its latest screen.
I think it's called my risk. And so in that case, you already have millions of women who have a my risk score, which depends on polygenic risk prediction for breast cancer. So to the people who are critical of embryo selection on the basis of science who want to claim that the science is not mature or the science doesn't work, those people generally don't want to let you let on or let you know that the efficacy of polygenic prediction of phenotype or disease risk has been validated now in hundreds of research papers, very large studies, like the one that I've just described to you. K? So if you're at all unsure about the scientific status of polygenic risk prediction, well, the best way to find out is to actually look through the scientific literature and look at the studies that have been done.
But as I said, there have been a huge number of studies typically establishing the validity and the predictive power of polygenic scores. So I consider this not really anymore an open question of science. There obviously, there are there are still people resisting that conclusion, but mostly they're resisting it for ideological reasons, and mostly they're resisting it in the context of embryo selection or reproduction. And and, typically, they're they're sort of not focusing at all on applications in in adult context. K?
So let me go on to the next screen. Now one of the big steps forward that happened in 2025 was a large study that was conducted, my lab. My research collaborators in my lab were participants in this. The main source of data was the Taiwan Biobank and the Taiwan Precision Medicine Initiative. The paper, which was published in Nature, October 2025, is called population specific polygenic risk scores for people of Han Chinese ancestry.
And this is the first study that builds polygenic predictors, which are of quality equal to the ones that already existed for people of European ancestry, but this time for a non European ancestry group. So in this case, East Asians. So it's now the case that we have very strong polygenic predictors for most important disease conditions and phenotypes, but which actually work for people of East Asian heritage. And so that that was a big step forward, in 2025, and this will open the door for more aggressive use of embryo selection in East Asian and other Asian populations. So I just wanted to bring that to your attention.
Here is a slide. This is a slide from my talk that I gave at the Berkeley Genomics Project Reproductive Frontiers meeting. I think that was during the 2025. These are my slides, as I said. And there, I discussed the preprint version of this paper.
So this is the preprint appeared in 2024. The paper finally appeared in nature in October 2025. Here on the slide, you can see that, we analyzed over half a million genomes. So that's the largest non European aggregation of genomes that's been studied to build polygenic risk scores. And on this slide, it says under review at Nature Genetics.
That was still the case when I gave this talk, but now the paper has been published. This shows a related paper, which is specifically about height prediction. You can see on the slide, it says few centimeter height prediction accuracy. So now for East Asians, we can easily identify outliers for short stature and outliers who are gonna be much taller than average. And if you can see the slide, you can see there's a pretty high degree of accuracy here in these predictors.
The JAMA study and this nature paper, these are examples of just continuing progress in the field of polygenic prediction. Most of the progress tends to be, as I mentioned at the beginning, due to improvements in access to data, so larger datasets or better datasets. And in terms of algorithms, we we're already at a point where we have pretty mature methods to build predictors once we have access to data, data in which the genome of the individual and that individual's disease status or phenotype to quantitative trait value are known. So how tall is the person or what is their cognitive score. K?
So let me skip ahead another couple slides from this presentation. This presentation was recorded as at part of the as part of this reproductive frontiers meeting, which was held at Lighthaven in Berkeley over the summer. I hope I have the date correct. May maybe I'm off, and it was, it was longer ago than last summer. But, anyway, it wasn't that long ago.
Any case, my talk, which is about an hour long, is available online. I'll try to remember to put a a link into the show notes so that if you wanna listen to the whole talk, you can. So the slide that I have up now is about cognitive ability. And on the right is a visualization of data, actual data, which is from this old Cold War era study called project talent in which they took psychometric measurements of a hundred thousand ninth graders. So hundred thousand ninth graders, I think that is you know, maybe at the time, there were two million ninth graders in The entire United States.
So this is something like measuring 5% of all ninth graders in The United States. This is a kind of serious type thing one could get done in the Cold War, which, you know, today wouldn't necessarily be done. If you look at the plot, it's a scatter plot, and the three different, coordinates are in standard deviation units, the spatial ability score, the mathematical ability score, and the verbal ability score. Nowadays, it's very hard to find measurements of spatial ability. It's almost some kind of, thought crime to even talk about something as what something that sounds strange to the woke mind as spatial ability.
Now what's interesting about this plot is that the data forms a kind of a libsoid where the which exhibits a sort of a positive correlation between each of these cognitive scores. And so people conditional on being, say, high on the math score or high on the spatial score or high on the verbal score, conditional on that, the probability that you're also above average on the other two scores is increased. And so that's what this ellipsoid structure represents, and you could think of the major axis of this ellipsoid as something like the general factor of intelligence. So so when you just see representations of data like this, it makes it appear that just due to the correlation structure of cognitability in the population, there is some kind of g score, some kind of single value that you could report as someone's overall cognitive ability. You can think of that as a single number way of compressing these three different scores, spatial, mathematical, and verbal.
So that that's just the figure that I I like and I put on the right side of the slide. On the left, the slide says, best SNP predictor correlates about point five with actual IQ. K? So that's a new advance that happened in 2025. This is a claim from the company Heracyte, which is doing, embryo selection.
We collaborate with Heracyte. So in some cases, they're using the data that comes from the genomic prediction, genotyping of the embryo in order to compute their embryo scores. So they're the ones who have claimed to have constructed a new better IQ predictor or g factor predictor, which has a correlation of about point five with the actual underlying cognitive ability. This result has not been replicated by other groups. It appears only in a white paper from Herocyte.
I believe what Herocyte claims to have done is that they use the UK Biobank data, and they construct the kind of synthetic estimator of fluid intelligence score, which is built out of other variables like the income or socioeconomic status or education level of an individual, and they use that to crudely predict the fluid int score. I think there are about a 100,000 or maybe a 150,000 individuals in The UK Biobank dataset who have had at least a crude fluid int test score, which is, I think, only, like, 12 or 13 questions. So, again, pretty noisy data, but parasite claims that in out of sample verification, the predictor correlates about point five with actual IQ. I think people who are skeptical about IQ in general or the g factor, and then furthermore, people who are skeptical about the heritability of that construct are surprised that one could build such a strong predictor. I would guess that with even better data, higher quality data than, what Herocyte used, one could do even better.
One could possibly get to something like correlation of point seven between the actual underlying cognitive score and the prediction value. Now already in simulations, Parasite claims that if you have something like 10 embryos to choose from and you pick the one with the highest polygenic score for cognitive ability, you could increase the expected score of that embryo, the the top polygenic score embryo relative to the average embryo or the randomly selected embryo from the 10, you could get an increase of about 10 IQ points, something like five to 10 IQ points, which is getting to the point where it's pretty significant. With height, if you were selecting the tallest that the embryo predicted to be tallest, say, among 10 male embryos, you could get something like a few inches to maybe two or three inches of increase on average in the height of your child. So it's getting to the point where the gain from selection on a reasonable number of embryos is at the place where people would care about it. You know, a few inches of height, five or 10 points of IQ, that's getting to the point where, you know, a family that's already going through IVF sort of becomes a no brainer to do this kind of screening.
So far, these topics that I've been discussing are purely scientific. Let me talk a little bit about embryo selection in IVF and what has been going on, not from a purely scientific point of view, but competition in the space and also the sociology of families going through IVF and IVF clinics. So here I have on the screen a Subway ad from New York City, which in this case, it's from a company called Nucleus Genomics, which is also in this space, and, we have in the past been a collaborator also with Nucleus. On this, advertisement, if you can't see my screen, it says, have your best baby, and then has a picture of three babies. And so Nucleus has been aggress have been has been advertising very aggressively.
Herocyte has been writing white papers, aggressively stating the gains in traits like height and IQ, which are quite controversial. And so we're starting to see possibly a breakthrough in the Overton window or in public consciousness about what is possible through embryo selection in IVF. Now our company, Genomic Prediction, which is actually the original company that first genotyped embryos for embryo selection and also computed polygenic scores for embryos, we have been deliberately conservative in this area. We have never offered cognitive ability prediction. We have never offered height prediction.
We have never offered cosmetic trait prediction even though we can do all of these things. And the main reason is because, a, we we think we thought society was really not ready for it, and also because individual IVF clinics, individual IVF doctors were quite nervous about it. And so for us to work with, I believe, to date, we've worked with something like 300 IVF clinics around the world, and we've genotyped something like 200,000 embryos. So I think we're orders of magnitude at least one order of magnitude, maybe one and a half orders of magnitude beyond any of the competitors in the space. But in order to get there, we had to be relatively conservative in the polygenic scores that we offered, and we only offered to date score prediction of disease risk.
So major diseases, things like heart disease, diabetes, breast cancer, prostate cancer, but also some psychiatric conditions like schizophrenia, that was sort of the limit of what most IVF doctors could tolerate. And some of these new companies that that are coming into the space are sort of gambling that the Overton window has shifted. Now there are academic studies that suggest this is the case. So there are surveys of the general population, IVF families, and also IVF physicians. And all of those show that since we founded the company Genomic Prediction in, I believe it was 2018, so that was seven years ago.
Since we founded the company, there's been a significant shift in acceptance levels of this technology to the point where I would say the majority, pretty strong majority are okay now, approve of embryo screening for polygenic disease risk, and a reasonably large minority approve of screening for traits like intelligence. If you aggregate the set of people who are in favor of allowing embryo screening for intelligence or are sort of neutral or at least not strongly opposed, that is the majority of the population. So it's a minority of the population now that's strongly opposed to intelligence selection. And so we may see it could be that 2025 will be that inflection for the beginning of that inflection point where you start to see public acceptance of screening on traits like intelligence, and we'll just have to see. Time will tell.
Now as I mentioned earlier, one of the breakthroughs of 2025 was good polygenic predictors for people of Asian ancestry. And in that population, I believe there's always been a very strong positive approval of even on selection even for selection of intelligence. And so that, you know, as we as we begin to be able to service that population with better and better predictors, I think the overall fraction of IVF users that are comfortable screening embryos for traits like intelligence is gonna slip is gonna move into the majority. So, it is gonna become, I think, accepted. Whether that takes just a year or two or that takes another five years, I don't know.
But I I think one can't deny that it's gonna happen. And so for the people who I think have been following genome prediction for the last seven years, you know, you can see that we've been moving very, very carefully along these lines. If at some point we begin operating in a particular society or culture where that society and culture is strongly in favor or approves of intelligence or height selection, things like this, then we may decide to offer it ourselves. So stay tuned for that kind of development. Now another sign of aggressive use of increasingly aggressive use of these technologies is among the super high net worth elites.
And it's been reported throughout the year that lots of super high net worth Silicon Valley types have been using polygenic embryo selection in reproduction. Just recently, the Wall Street Journal reported on Chinese billionaires coming to The US and doing aggressive IVF, often using donor eggs and surrogacy. And so, for example, there was a super wealthy Chinese guy, Xubo, who was profiled in this Wall Street Journal article. Now he reportedly has something like a 100 or more sons that have been born through surrogacy and IVF. Furthermore, it was reported that he prefers eggs from Jewish women.
So he is a phyllosemite, not an antisemite. He is a phyllosemite. He wants, eggs from Jewish women. And, so this news got a lot of attention. If you can see my screen, you can see a little capture from a video.
I think in that video, there are about a dozen sons. They're they're little kids, little toddlers, Very cute. And we'll see whether he really has a 100. I'm not sure that's really true, but that was what's was reported in the Wall Street Journal. This is a picture of Shubois from the Wall Street Journal article.
Here's another snapshot, if you can see my screen, of, some of his sons. They're all very cute. They all seem to be pretty similar in age, and no doubt he used donor eggs and surrogates. And who knows? Possibly polygenic prediction, polygenic selection of embryos to produce these sons.
So I think 2025 marks a number of important changes in the landscape for polygenic selection of embryos, reproduct advanced reproductive technologies. We will have to see how this evolves into the future. If I were to make some predictions since it's the end of the year, I guess it's appropriate for me to make some predictions. I think I can make several. One is that we will continue to see more and more validations of the core technology.
So I think it'll become increasingly untenable for some bioethicist or some ideologue who just doesn't like embryo selection to claim that polygenic prediction of risk scores or polygenic prediction of traits like height or IQ just don't work. I think it's already scientifically untenable, but people can still get away with it talking to journalists. But I think it'll it's gonna become increasingly untenable to have that position if you wanna have any level of scientific credibility. We're gonna continue to see improvement in the quality of predictors to the point where the gain from doing embryo selection is going to be very, very obvious. The size of the gains are gonna be something that people can't ignore.
I think that the Asia Pacific or generally the East Asian and South Asian markets are going to grow very fast because there's no cultural resistance in those parts of the world to using this technology. And now finally, the technology is at a point where we can do pretty strong polygenic prediction across a variety of traits and disease risks for those populations. So my prediction is that market will grow very fast in coming years. And then a third prediction is that elites will continue to be the leaders, the most aggressive in using this technology. And I suspect that the general population will start to appreciate that, well, super rich people like Elon Musk I'm not saying Elon specifically is doing this, but people like Elon Musk are doing this or Shubua, the guy whose babies are still on my screen.
They're doing it, and I think this is gonna change the attitudes of average people. And they'll go from, oh, this is some icky weird thing that we don't understand, and I'm afraid to say publicly that I prove it because some woke scold is going to yell at me that I'm a eugenicist or that I'm a Nazi. I think that's gonna gradually go away. And instead, it will be replaced by kind of FOMO, fear of missing out so that someone who isn't super wealthy, ultra high net worth, but merely high net worth or merely affluent is gonna ask themselves as they go through IVF, hey. When I'm what are we missing out on?
What is my family missing out on in terms of ensuring the health and well-being of my children that someone like Peter Thiel or Elon Musk actually is engaging in. And so this 2025 might mark that inflection point. So maybe next year at this time, I'll report back on what happened in 2026. Let's move on now to the second topic for this year end episode. We just completed our discussion of what happened in genomics, particularly in polygenic prediction of complex human traits and embryo selection.
Now we'll talk about AI in 2025. And, of course, AI is probably the single biggest topic in all discussion and media of 2025. It's already starting to change our lives. It's changing the way that professors like myself teach our courses. It's changing the way that scientists do research, and it's created what some people call an enormous investment bubble.
And in the remarks I'm about to make, I'm not gonna talk about the things that which are most commonly covered, like the AI bubble, NVIDIA, the hype cycle. I'm gonna focus on really an area that's, I think, maybe best described as AI research and and talk about the improvements of the highest level capabilities of the models that happened in 2025. And I think this is not really, information about this is not available, I think, to the average person. I think the average person is stuck more or less listening to a bunch of hype that comes from self motivated AI founders or possibly looking at some benchmarks, but not necessarily being able to interpret very well or intuitively what those benchmark scores mean. I'm going to try to shed some light on the question of how much did models really improve in 2025.
So if I take the best performing models available right now, December 2025, and compare them to what was available at the 2025, I I would claim there's a very, very significant qualitative difference in the performance of those models. So the last episode of Manifold that I released was about theoretical physics with generative AI. So that was Manifold episode one zero one. This one that I'm recording right now is one zero two. And let me just briefly review what was said in one zero one or or what happened with theoretical physics research and generative AI.
I actually published a paper of original physics research, which was largely driven by AI. In other words, the core idea in that paper, which is about state dependent or nonlinear modifications to quantum mechanics. I published that paper in physics letters b, and it might be the first physics paper in which the core idea came from an AI, in this case from GPT five. I wrote a companion paper to go with the actual physics research paper. The companion paper is maybe more of interest both to AI researchers and to theoretical physicists.
And then I had a discussion with two other theoretical physicists who are interested in exactly this topic and actually wrote either on their substack or blog or in the form of an actual scientific preprint or scientific paper, a critique of the work that I had done. And so on the screen, you can see a link to the previous episode of manifold where I discussed this. This is my own substack page. If I scroll down here, you can see the posts on x that I made on this topic. The actual preprint, the actual paper that I wrote that was published in physics letters b is called relativistic covariance nonlinear quantum mechanics, Tomanaga Schwinger analysis.
So that's pretty esoteric title. I won't get into the physics content of this paper. I'm gonna focus more on the AI here. Here's a link to the companion paper in which I described the process of the contribution of the AI model to the actual work that was accomplished. And here we have a YouTube recording.
This is about I think it was about eighty minutes of discussion with myself and two other professors who both of whom wrote reactions in some sense to the work that I had done. This is an earlier episode of Manifold number 97 in which I interviewed professor Lin Yang of UCLA who has a background both in computer science and physics. And, Lin, his research in one of his research publications, he shared a version of, I I would say, scaffolding around an AI that allowed using for that AI any of the leading models that was available, say, in mid twenty twenty five. So for example, GPT five, Gemini 2.5, Grok, any of those models, I think Claude as well, perhaps, he was able to, using this scaffolding, which I'll describe a little bit, get that model to perform at the gold medal level. So he took the on the International Math Olympiad.
So he took the most recent problems from the IMO immediately when they were were released. So when the competition happens, so those problems were presumably not in the training data for the model. And he showed that by building the scaffolding around any of these off the shelf commercial models, he could get them to the point where they could write correct proofs for five out of six of the IMO problems. And the the the architecture that he used to elicit this level of performance from the models, which I call and I think he also calls a generator verifier pipeline, I also use that in the physics research that I performed. Now, interestingly, since that work was done, since he wrote his paper and since I wrote my paper, just in the intervening weeks, DeepSeek released a version of their model, 3.2 Speciale.
It's a funny name, 3.2 Speciale, which has a very large token budget and without scaffolding, without this generator verifier structure that, both Lin Yang and I used, that version of the newest version of the model is actually able to also perform at the gold medal level just by itself. K? And, this isn't true of any of the other off the shelf models. Typically, those off the shelf models would get maybe one problem correct out out of six and certainly not five out of six. But now we have examples like DeepSeek 3.2 special and also the scaffolded models, the models that are run through this generator or verifier pipeline, which can perform at that level.
And and that is a level which is extremely high. So the at this at this sort of highest level of model performance on contest math, they're performing at the gold medal level, so that's, like, you know, maybe one in a 100,000 humans can do that. And on a more sophisticated set of competition problems from, say, the US Putnam exam, which include sort of higher level undergraduate level mathematics, not just high school level mathematics, these models are also performing at, I wouldn't say completely superhuman level, but comparably to really the best human problemists who are trying to do these very difficult contest math problems. So it's an example of what the peak level performance of these models is, able to solve these extremely difficult competition problems, able to assist humans in coding. So maybe the most economically impactful use of models right now is in the software industry using them to actually write code, debug code.
That's become a very big thing. And in my case, the model's able to actually produce some interesting ideas and analysis of those ideas for theoretical physics. So so that that's what I wanna talk about for maybe the next twenty minutes is this highest level performance of the models and and where I think that is gonna go in this in the future. And one of the things that I want to emphasize is that this dramatic improvement in this high level capability happened on the time scale basically one year. So so the models were not particularly good at this kind of thing a year ago.
Now they're extremely good to the point where very, very few humans can compete with them. And this is in an area which, in the case of software, writing code, software development, is very economically impactful. In the case of solving math problems or understanding physics papers, very, very impactful for the progress of frontier science. Okay? So most professors, if they had a grad student that could do math as well as an IMO gold medalist, they would be extremely happy.
That would be a a great find, a great addition to their research team to have a student like that in their group. But now you you can have access to that if you just turn on DeepSeek 3.2 special or you rig up this generator verifier scaffolding around an off the shelf commercial model. Also, I should mention in my work in this area, I worked with a team at DeepMind who had built something called coscientist, which is also a kind of scaffolded scaffolded enhanced version of their best Gemini model and with a very large token budget. And that thing, coscientist, also, I believe, is becoming quite useful to research scientists. Not saying that the models are at a point now where they could actually replace a highly experienced research scientist.
That is still a necessary ingredient in producing new novel or important research results. But I would also make the case that, AIs are becoming very useful for this kind of activity. I mentioned this conversation that I had with here pictured here is professor John Jonathan Oppenheim, who's at University College London, and the interlocutor who led the discussion between Jonathan, myself, and myself, Nirmalia, who's at IIT. After we recorded this, we were I I think I don't recall whether it's on the actual discussion that was recorded. It might have been something we discussed afterward.
But I think Nirmalia and I were a little bit more positive about the use of AI in research than Jonathan, and we sort of were discussed continue to discuss this after we stopped recording. I think that's right. And, I said something like, I think within a year or two or at least a relatively short amount of time, something like 70% of all the younger physicists will be regularly using AI to assist them in research, to assist them in a nontrivial way in their research. To my surprise, Hermalia said, who's younger than we are, younger than both Jonathan and myself, said something like, oh, I think that's already true. So in other words, he said, among, say, physicists under 35, probably 70% of them already are using AI quite aggressively in their research.
So I think most average people who aren't research scientists at the frontier find this to be pretty shocking. I think if you were an attorney or an accountant or some kind of white collar knowledge worker, you might think, oh, having seen only the one shot performance of models, so I just I just give the model a prompt. I maybe upload a legal case or some spreadsheet, and I ask them all to do some stuff. I think there would be some fraction of people who would see the potential and say, yeah. This thing could extremely powerful and replace a lot of human labor already.
But there would be other people who would say, oh, I see a lot of mistakes. This thing makes mistakes. I think it's still, you know, not ready for prime time. You know, it could be useful in some very narrow situation, but not broadly speaking, and it's not gonna replace that many hours of human labor. I think the key issue in those claims or those remarks is that these people only have access to the one shot performance of the model.
So one shot performance of the model means you just type maybe upload some information to it, then you write the prompt, and then you get the answer. But this generator verifier pipeline or more involved reasoning capability like Speciale has or what people in the industry are calling agentic workflows where multiple instance of the instances of the model are collaborating and critiquing the the response before it's shown to a human. I think most of the people who have that kind of reaction that I described are not aware of the significant improvement of the model performance when you scaffold it or embed it in this kind of generator verifier or AgenTeq pipeline. Because very few people actually have seen output from models that are used in that way. Again, if you wanna see dramatic example of this, look at Lin Yang's paper.
There's a whole GitHub repository in which he shows you the I think he has about three pages of prompting to put the model into the mode of a verifier or to put the model in the mode of a a generator of a proof or refiner of a proof in response to verification. So different roles that the model is playing in the pipeline, and definitely a lot of tokens are burned through the process. But there's a dramatic difference of instead of being able to solve, say, one out of six problems, it's able to write a correct proof for five out of six problems. So that's a that's a huge delta based on this extra scaffolding, which most people who purport to give you some opinion about AI capabilities have never seen, have not really looked carefully at. K?
So one of my main take home points is that for people who have experimented with this kind of generator verifier pipeline, they've already seen a qualitative bump in what models can do, but that's a very small fraction of people who are working with the models at that level and also have the core expertise to tell the difference between model is only as good as first year grad student at solving these math problems or physics problems or understanding these physics papers or maybe a first year law school student versus model is actually doing very nontrivial things if I scaffold it in this way. K? So that gain in quality is going to make its way relatively soon into the off the shelf model, and Speciale is the first example of that. So DeepSea 3.2 Speciale has IMO gold medal capability without any of this extra scaffolding. And I haven't tested it on physics knowledge, but I I suspect it's qualitatively a jump beyond what the other off the shelf models will do in terms of analyzing a physics paper or doing some symbolic calculation.
So, I think we can already say with confidence that most people who are commenting on the current capabilities of AIs have not actually themselves seen or appreciated the peak performance of these models that's actually available already in 2025 and will become available in off the shelf models for sure, I think, in 2026. Okay? Now beyond that, I think what is going to happen is both the baseline capabilities of the models through pretraining and through RL are going to continue to increase. We are not seeing a slowdown in the rate of improvement of these models. And furthermore, as we get make them better and better at agentic collaboration in a pipeline so that you break the problem up into small pieces.
You have different agents, generator verifier agents, if you wanna call them that, attacking different parts of the problem. The verifier is checking the solution, not showing any of that to the human user until it's been processed through potentially huge number of tokens, you know, millions of tokens of inference, and then showing you the final answer. I think we don't know how good that will be, you know, just say how much of an improvement we'll see in that capability just in the next year or so, and it could it could be quite dramatic. So I wouldn't be that surprised if by the 2026, the models are extremely good at math and physics and general science and possibly extremely good at analyzing legal documents, combing through spreadsheets, looking at financial statements, and such. I think it's it's definitely within reach to have a qualitative jump over the next year in the peak capabilities of these models.
Could be very expensive in terms of inference costs, so it could be that this kind of deep processing or deep research uses an order of magnitude more tokens than a typical one shot response that you're getting right now. Even from a very good thinking model, it might be another order of magnitude more in inference tokens used. But I believe there will be a significant quality jump corresponding to that additional inference and additional scaffolding. So my prediction for 2026 is that we're gonna see continued improvement. Now if you if you if you dig down into model training and you ask, well, how how are they actually getting this improvement?
I can make a couple of comments. So so let me take a very specific case, which is the use of models to do symbolic math. K? Now here, I don't mean necessarily solving a tricky IMO problem. What I mean here is you're a physicist or an engineer, and you have some symbolic math you need done.
Like, you need the the the AI to do an integral for you, fold it in with some other calculations, maybe solve some algebraic equation, maybe make a plot. Those are things that the model in the last in the the best models in the last year have made huge strides in. So a year ago, if you asked it to do some symbolic calculation, generally, if it were a very obvious textbook symbolic calculation that it's seen before, like it's actually seen literally that calculation done or something very similar, then a year ago would have a decent shot of maybe giving you a result, but it would also potentially make a mistake. But I believe what's happened in the last year is that as the labs all the labs have prioritized making the model better at math and and science, and and they focus a lot on reinforcement learning, The models already have a decent understanding of the underlying concepts. So if you ask it what an electron is or what a photon is or what a derivative is or an integral or matrix multiplication, the models have some within their, you know, trillion parameter structure.
They already have some understanding in some sense or at least encoding of those core concepts. And I believe through RL, what's happened is that one can give the models as they go through their reasoning steps. One can give the models eval functions, training evals in which a symbolic math problem is given to them. Again, not a specifically intentionally tricky problem, but one in which a set of manipulations needs to be done. Maybe each individual manipulation is relatively straightforward, but there's a chain of them, and then the thing comes back with a result.
One can generate synthetic data for reinforcement learning for that set of problems by just using symbolic engines like Mathematica. Okay? So there there for for nonscientists, there are there are already existing programs like Mathematica, which was Steve Wolfram's the product of his company, Mathematica. That's already heavily used by quantitative scientists. So if you need to do an integral, you need to do a plot, you need to even, like, numerically solve some simple differential equation.
A lot of people are doing it in Mathematica or in other open source symbolic math packages that are similar to Mathematica. And, generally, the error rate there is close to zero because when Mathematica is doing an integral for you or it's simplifying some algebraic expression, it is following known algorithms. So it's it's not guessing answers or anything like that. It's following some procedures to get the answer. And, generally, if it does succeed in giving in giving you an answer, the answer is correct.
But using something like Mathematica or just some symbolic math package, I can produce an almost infinite amount of training data in which I can force the large language model, which already has baked into its connections through the pre training, already pretty good understanding of what an integral is, what a derivative is, you know, what a vector is, what a matrix is, or at least a compressed compression of those concepts is in its connections. I can use RL with synthetic data from symbolic calculations to make it really good at symbolic calculations. So to to get it into, in a sense, a habit, an RL induced habit of doing those symbolic computations step by step, doing them carefully, chaining together five steps to actually reach the right result. Okay? And so I I think that's something that happened in the last year.
So if you're a physicist or an engineer and you're using the language models a lot and you're using them to actually do calculations, there's been a tremendous improvement in their ability to do symbolic. Just, again, not super, super hard Olympiad math, not necessarily generating a proof, although it has also independently improved in those areas. I'm talking about pretty prosaic things, which, you know, a human could go through these calculations and do them, but it might take hours. And now the LLM is capable of doing them, you know, very fast within a matter of seconds. K?
So that's an example of a a very specific capability, but a capability which is central to progress or just day to day research activity, which suddenly the models in the course of basically about a year went from pretty unreliable at doing it to pretty reliable. I I wouldn't say necessarily they have a 99% accuracy at this stage yet, but I think the accuracy went up substantially. And it's to the point now where it's quite useful to the researcher. So if you just ask it to do some symbolic calculation, sure, you you still need to check and look at the results to see whether it made a mistake, but it's quite likely that it didn't make a mistake. And now you could reply, well, you could have already done that with Mathematica, but, actually, in in the case of Mathematica, you have to enter the calculation that you want the the the program to do in a very rigid formal syntax.
So you have to say int, you know, open bracket, you know, integrand measure, da da da da da. You're right. You you have to do all this in a very it's almost like writing a little program. Whereas the models, LLMs can understand the context of the task that you're trying to give them. They can pretty much figure out what you want them to do, and so it's much easier just to write to them, not completely in English, but in some the same kind of way that you would talk to a grad student or a research collaborator.
And then the model will understand and then do the symbolic calculation properly. K? So I'm just giving that as an example of something that on a time scale of a year, the models made tremendous improvement in that capability. There's no reason not to expect continued improvement like that so that, oh, there's some standard kind of analysis that's done in in physical chemistry or, you know, molecular biology or something like this. And, you know, through the efforts of people in the labs trying to source high quality data, create high quality evals, subject the model both to better pretraining, but also post training reinforcement learning to give them that specific set of skills, all of a sudden, the research utility of the models just gets that much better.
And I'm sure that similar things are happening in coding. I think even more energy is going into making them better at writing software, debugging software, understanding libraries, things like that. So I guess my prediction is that 2026 could be the tipping point where through these kinds of training efforts, both pretraining and post training efforts, but then also additional scaffolding, that scaffolding will increasingly look like agents, different instances of the model with that have been prompted differently or maybe even are themselves qualitatively different from each other, collaborating in a pipeline. Okay. That's the generalization of this generator verifier pipeline that I was talking about.
That aggregate agentic capability, I think, will also potentially improve dramatically in 2026. So I think a year from now, I am predicting continued rapid advancement in the capabilities of these models. I'm not predicting a slowdown even though there may be lots of challenges like, oh, with it. I can't increase the pretraining dataset by an order of magnitude easily, etcetera, etcetera. But I think with synthetic data, with human input to generate good evals, I think there's still substantial progress that these labs can achieve in the model.
So I I think a year from now, we're gonna be amazed at how good the models are. I've gone on now, I think, just over an hour. I wasn't trying to cover the whole AI space. I just wanted to cover one particular thing that I noticed. I could come back later perhaps and talk about US China competition, what's going on with semiconductors and NVIDIA, how is this AI bubble gonna play out, what's gonna happen with data centers.
That's not my purpose here. Maybe I'll come back and do an episode maybe with a guest to talk about those other topics. Thank you very much for being a manifold listener in 2025. It's been a great year for me. So many fascinating things happening in the world.
That's a great time to be alive. I hope that all of you are doing well. Thanks so much. Have a wonderful holiday and a happy New Year.
Polygenics and Machine SuperIntelligence; Billionaires, Philo-semitism, and Chosen Embryos – #102
Ask me anything about this podcast episode...
Try asking: