Episode	Podcast	Published	Duration	Status

Machine Learning Street Talk (MLST)

"I Co-Invented the Transformer. Now I'm Replacing It." & Continuous Thought Machines - Llion Jones and Luke Darlow [Sakana AI]

November 23, 2025•1h 12m•12,271 words

Description

The Transformer architecture (which powers ChatGPT and nearly all modern AI) might be trapping the industry in a localized rut, preventing us from finding true intelligent reasoning, according to the ...

Despite the fact that I was involved in inventing the transformer, luckily, no one's been working on them as long as I have, right, with maybe the exception of the other seven seven authors. So I actually made the decision earlier this year that I'm gonna drastically reduce the amounts of of research that I'm doing specifically on the transformer because of the feeling that I have that it's it's an oversaturated space. Right? It's not that there's no more interesting things to be done with them, and I'm gonna make use of the opportunity to do something different, right, to actually turn up the amount of exploration that I'm doing in my research.

We just released the Continuous Thought Machine. It's a spotlight at NeurIPS twenty twenty five this year. You should care about it because it has native adaptive compute. It's a new way of building a recurrent model that uses higher level concepts for neurons and a synchronization as a representation that lets us solve problems in ways that seem more human by being biologically and nature inspired.

The atmosphere in AI research was actually quite different back during the transformer years, because it doesn't feel like something similar could actually happen right now because of the reduced amount of freedom that we have. Right? The transformers was very, very bottom up. Right? It's not that somebody had this grand plan that came down from on high that this is what we should be working on.

It was a bunch of people talking over lunch, thinking about what the current problems are and how to solve them, and having the freedom to have, you know, literally months to dedicate to just trying this idea and having this this new architecture fall out. We've spent hundreds of millions of dollars. The biggest sort of evolution based search is probably in the tens of thousands. We have all this compute. What happens?

What happens if you scale up these search algorithms? And I'm sure you will find something interesting, you know, when someone eventually does buy that bullet and really scale up these evolutionary sort of a life experiments. Because I pitched it in an environment where people were just going all in on this one technology, I got zero interest. So now I have my own company, and I can pursue those directions.

This podcast is supported by CyberFund.

Hey, folks. I'm Omar, product and design lead at Google DeepMind. We just launched a revamped vibe coding experience in AI Studio that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app, and Gemini will automatically wire up the right models and APIs for you. And if you need a spark, hit I'm feeling lucky, and we'll help you get started.

Head to a i.studio/build to create your first app.

Tufa AI Labs is a research lab based in Zurich. They've got a team of amazing ML engineers and research scientists. They're doing some really cool stuff. If you look at their website, for example, you can see what their approach was for winning the ARC AGI three pub competition, which closed out a few months ago. And they are hiring amazing ML engineers and research scientists.

They also care deeply about AI safety. So if any of that is a fit for you, please go to Two for AI Labs and, give it a go. The audience will know I'm a huge fan of, Kenneth Stanley's ideas. So his book, Why Greatness Cannot Be Planned, changed my life. It was absolutely insane.

And what he was speaking to is that we need to allow people to follow their own gradient of interest, unfettered by objectives and committees and and so on. Because that is how we do epistemic foraging. That when you have too many agendas involved in the mix, you kind of end up with a grey goo. And you don't discover, you know, interesting novelty and diversity. And I suppose that's basically the thesis of of your company, Sukana, is to lean into those ideas.

Yes. Exactly. At the company, we're a massive fan of that book. We're we're hoping to have him come and talk at our company next week, actually. And it's a philosophy that we we do talk about internally.

Right? We have copies of the books in including the the recent Japanese translation. As, you know, one of the cofounders, one of my my main jobs, one of the main things that I have to keep doing for this company is making sure that we protect the freedom that the researchers currently have. Right? Because it's it's it's a it's a privilege, really, that we have the resources to be able to do that.

And inevitably, as I've seen happen, as the company grows, more and more pressure comes in, and it narrows the freedom. But I think because, you know, we we believe in this philosophy so strongly, I'm hoping that we can give people all the research freedom that we do now for as long as possible.

And what are those processes that curtail freedom as a company matures? I mean, how would you describe that?

It's great that there's never been so much interest and people and talent and resources and money in the industry. But, unfortunately, that just increases the amount of pressure people have in order to compete with all the other people working on it and trying to get the the value out of this technology and making money. And I think that's what just happens. Like, as a start up, you have a a feeling of, you know, excitement and trying something new. And right at the beginning, you have a bit of a runway, so you have the freedom to try different things.

But inevitably, people are starting to ask for returns on their investments, or they're expecting you to churn out some product. And this just, unfortunately, reduces the the the creativity that the researchers have because, you know, the the pub the the the pressures to publish or the pressure to to create technology that's actually useful for the products that we have goes up. And so the feeling of autonomy, I think, starts to go down. But, you know, I literally tell people when they start working for the company, I want you to work on what you think is interesting and important, and I mean it.

There there is a I mean, in YouTube, there's a phenomenon called audience capture.

Right.

And I think there might be a phenomenon called technology capture, which is that in the early days of Google, it was quite open ended. And what I mean, transformers is now the ubiquitous backbone of all AI technology, and it's a huge achievement that that you're involved in that. But it I mean, there's a similar story with with OpenAI. They're now starting to see all of these commercialization opportunities. They they can I mean, they're gonna become LinkedIn?

They're gonna become an application platform. They're gonna become a search platform. They're gonna become a social network. And and I guess this could happen to you guys that there's a very strong chance, especially with your new paper that we're gonna talk about today, this continuous thought machines. It it could be a revolutionary technology, but then it will become obvious how it could be commercialized.

And that's how those pressures come in.

I I like the I like the audience capture analogy. I think there's definitely been some kind of capture by large language models. Right? They they worked so well that everyone wanted to work on them. And I'm really worried that we're kind of stuck in this local minimum now.

Right? And we sort of need to try to try to escape it. So we spoke about the transformers, but there's a there's a time just before the transformers that I'd like to talk about because I think it's quite illustrative. So, of course, the the main technology before transformers was recurrent neural networks. Right?

And there was a similar feeling. Right? When recurrent neural networks came in and we we, you know, we discovered this new sort of sequence to sequence learning, that was also a massive breakthrough. Right? The the the the the translation quality went up massively.

Right? Voice recognition, quality went up massively. And there was this a similar sort of feeling then of, like, okay. Yes. We've, you know, we found the technology, and we just need to sort of perfect this technology.

And back then even, my my favorite, task was, character level language modeling. Right? So every time a new RNN based character level language modeling paper came out, I got quite excited. Right? I I I, you know, I'd wanna, like, quickly read the paper, like, okay.

How did they, you know, how did they get the improvements? But the papers were always these just these slight modifications on the same architecture. Right? It was LSTMs and GRUs and maybe initializing it with the identity matrix so you could use the ReLU function or, like, maybe if you put the gate in a different place or if you if you layered them in a slightly different way or if you had gating going upwards as well as as sideways. And I remember one of my favorites was this, like, hierarchical LSTM where it would actually decide to compute or not compute at the different layers.

And if you trained on Wikipedia and you looked at the structure of when it was decided to compute or not compute, it kind of looked like the structure of of the the sentences were actually being picked up by the model. And I used to love that sort of stuff. Right? But the the the the improvements were always, like, 1.26 bits per character, 1.25 bits per character, 1.24. That was the result.

That was publishable. Right? That was exciting. But then after transformer, the team that I went on to afterwards, right, we applied for the first time very deep transformer models, decoder only transformer models to language modeling, and we immediately got something like 1.1. Right.

So so something that was so good that people actually come to our desk and politely tell us, like, I I think you've you've made it error, like a calculation. Do you think it's gnats, not bits per character? And we're like, no. No. We you know, it really is the the the the correct the correct number.

What struck me later is that all of a sudden, all of that research, and to be clear, very good research, was suddenly made completely redundant.

Yes.

Right? All of those endless permutations to to RNNs were suddenly seemingly a waste of time. We're kind of in the situation right now where a lot of the papers are just taking the same architecture and making these endless amount of different tweaks of, like, you know, where to put them, normalization layer, and slightly different ways of training them. And and we might be wasting the time in exactly the same way. Right?

Like, I personally don't think we're done. Right? I don't think that this is the final architecture, and we just need to keep scaling up. There's some breakthrough that will occur at some point, and then it will once again become obvious that we're kind of wasting a lot of time right now.

Yeah. So we are a victim of our own success. And this basin of attraction there are so many basins of attraction. Sarah Hooker spoke about the hardware lottery, and this is a kind of architecture lottery. And it it it actually made me think of the agricultural revolution, which is that this kind of phase change happened, And all of the folks that had these skills that were so necessary, these diverse skills for living and surviving, they died out.

And that's actually quite paradoxical because we need those skills to take the next step. And so we we're now in this regime. We've got the term foundation model. And the implication is that you can do anything with a foundation model. In the corporate world, we used to have data scientists.

You know, there there were ML engineers doing these architectural tweaks even in, you know, midsize enterprise. And now we just have AI engineers who are just doing prompt engineering and so on. So you're saying that the fundamental skills that we need to be diverse, to think of new solutions and new architectures, they're dying out?

I think I'm gonna disagree with that. I think the problem is we have plenty of very talented, very creative researchers out there, but they're not using their talents. Right? For example, you know, if you're in academia, there's pressure to publish. Right?

And if there's pressure to publish, you think to yourself, okay. Well, I have this really cool idea, but it might not work. It might be too weird. Right? It might be difficult to get it accepted because I have to sort of, like, sell the idea more, or I can just try this new position in betting.

Right? The problem is that the current environment, both in academia and in companies, are not actually giving people the freedom that they need to do the research that they probably want to do.

I mean, there's also this interesting thing that even in spite of great new research I mean, I I was speaking to Sepp Hoch Hochriter, and he's got all of these new architectural ideas, and OpenAI aren't implementing them. I mean, Google are doing this diffusion language model, which is quite cool. And I'd I'd like to know your opinion on why that is. So there's a few philosophies floating around like this concept of a universal representation, that there are universal patterns and the the transformer representations resemble those in the brain. And it's rather led to this idea of, well, we don't need to use different architectures architectures because if we just have more scale and more compute, then all roads lead to Rome.

So why would we bother doing it any differently?

There's actually better. Right? There is actually already architectures that have been shown in the research to work better than Transformers. Okay? But not better enough in order to move the entire industry away from such an established architecture where you're familiar with it, you know how to train it, you know how it works, you know how the internals work.

Right? You know how to fine tune them. You have all this software that's already set up for training transformers, fine tuning transformers, inference. So if you wanna move the industry away from that, being better is not good enough. It has to be obviously, crushingly better.

Transformers were that much better over RNNs. Okay? Transformers where you just applied it to a new problem, and it just was so much faster to train, and you just got such higher accuracy that you just had to move. And I think the deep the deep learning revolution is also another example of that, right, where you had plenty of skeptics, and people were pushing neural networks even back then. And people are going, no.

We think symbolic stuff will work better. But then they demonstrated it as being so much better that you couldn't ignore it. This fact makes finding the next thing even harder. Right? That's the gravitational pull of always pull pulling you back to, oh, okay.

But a transformer's good enough. And, yeah, you made a cool little architecture over here that, yeah, it looks like it's it's got better accuracy, but OpenAI over here just made it 10 times bigger, and it beats that. So let's just keep going.

May I also submit that there could be an additional reason, which is you know, I I love that fractured and tangled representations paper. There's there's this shortcut learning problem. Mhmm. And I think that there's a little bit of a mirage going on here, and there there might be problems with these models that we don't you know, that we're not fully aware of. And there's also this thing that we're seeing that we are starting to bastardize the architecture.

So we know we need to have adaptive computation for reasoning. We know we want things like uncertainty quantification. And what we're doing is is we're bolting these things on top rather than having an architecture which intrinsically does all of these things that we know we need.

Yeah. And I and I think the the our continuous thought machine is is an attempt at addressing those, more directly, right, which which Luke will be able to tell you more about later. There's something still not quite right with us the current technology. Right? I I think the the phrase that's becoming popular is jagged intelligence.

Right? That the fact that you can ask an LLM something and they can solve literally, like, a PhD level problem. And then, you know, in the next sentence, it can say something just so clearly, obviously wrong that it it's it's jarring. Right? So I think this is actually a reflection of something probably quite fundamentally wrong with the current architecture as amazing as they are.

The current technology is actually too good. Okay? Another reason why it's it's difficult to move away from them. Right? So they're they're too good in in in the following sense.

Then you you spoke about the fact that we have these foundation models. That's okay. So that we have the foundation that we can do anything with them. Yes. I think current neural networks are so powerful that if you have enough patience and enough compute and enough data, you can make them do anything.

But I don't necessarily think that they want to. Right? We're sort of forcing them. Like, they're universal pro approximators. But I think there are probably a space of, you know, function approximators that will more want to represent things in the way that a human represents them.

So there's actually quite an obscure paper that is my poster child for this. It's called intelligence matrix exponentiation. And I think that it was actually rejected. So, you know, you can probably project the image of a figure one, but there's an image of it solving, you know, the the classal spiral dataset of needing to separate the two classes in the spiral.

Yes.

And it has the decision boundary for a for both a classic RNN, multilayered perceptron and a a tan h multilayer perceptron. And you can see they both solve it. Right? Technically, they both solve the problem because they they they classify all the points correctly and get a very good good test score on this on this very simple dataset. And then they show you the decision boundary for the for the m layer that they build in this paper, and it's a spiral.

The layer represented the spiral as a spiral. Shouldn't we? Should and, you know, if the data is a spiral, shouldn't we represent it as a spiral? And then if you look back at the decision boundaries for for the spiral and the classic ReLU multilayer perceptron, it's clear that you just have these tiny little piecewise linear separations. And and that's what I mean.

Yes. If, you know, if you train these things enough and you push these little piecewise linear boundaries around enough, it can it can fit the spiral and get a high accuracy. But there's no feeling when I look at those that that image that the ReLU version actually understands that it is a spiral. Right? And when you represent it as a spiral, it actually extrapolates correctly because the spiral just keeps going out.

You're touching on something fascinating there because, you know, we were talking about the need for adaptivity and adaptive computation. I'm really inspired by Randall Bellis Dreiro's spline theory of of neural networks, and we we've had them on many times. And you can look on the TensorFlow playground. You can look what happens when you have a ReLU network on on this, you know, spiral manifold. And, you know, you'd you'd be forgiven for thinking that these things are basically a locality sensitive hashing table.

Right? Because they they do. They they they they partition the space, and and they they can predict the spiral manifold. Right? But we wanna do something a little bit more different than that.

And it it also comes into this imposters thing because just tracing the spiral manifold, but not continuing the pattern, there's a big difference between that. So from an impostor perspective, just tracing the pattern is not learning it abstractly or constructively. Right? If we learned it constructively, so we you know, you speak about this in your paper, this complexification, the abstract building blocks, and you can do adaptive computation. You understand the spiral.

That means that with adaptive computation, you can continue the spiral, and then you can update the model's weight so it has adaptivity because that's so important for intelligence. So we know that we need models that can do these things. But for some reason, they're they're so sycophantic. They're they're almost better than an adaptive intelligent system because they tell us exactly what we want to hear. They seem so intelligent, but we know that they're missing these fundamental properties.

I'm still fairly skeptical when I see video generation models. You know, we went through a phase where you could detect them because of the number of fingers on somebody's hand. Right? And, yes, with more data, with more compute, with better training tricks, okay, they submit, and now they usually do have five fingers. But did we fix the problem, or did we just use more brute force to just, you know, force the the neural network to know it's five fingers, where something that actually had a much better kind of representation space it's almost mad that it's controversial to say that we should represent a spiral like a spiral.

But, you know, something that could do that generally, that if it if it represented a human hand the way that, you know, maybe I represent a human hand, then maybe it would be much easier to count how many fingers are on a on a hand. It's unfortunate that they work so well. It's unfortunate that scaling works so well because it's too easy for people to just sweep these problems under the carpet.

You guys have possibly created what I think might be the best paper of of the year. This could actually be the innovation, which takes us to the next step. And you did did you get the spotlight in Eurips as well? Yeah. This year?

And congratulations on that. So I think that's testament to how amazing this paper is.

The CTM, the continuous thought machine, it's actually not that far outside of the local minimum that we're stuck in. Right? It's not as if we went and found this completely new technology. Right? We took quite a simple biologically inspired idea, right, of these of of the fact that neurons synchronize, and not even necessarily in a biologically plausible way.

Right? Brains don't literally have all their neurons wired together in a way that they work out their synchronization. But it's it's the sort of research that I wanna encourage people to do, and the way to sell it is quite easy, I think. At no point did we have to worry about being scooped. Right?

That stress was taken away from us completely. So and there was so there was no pressure to sort of rush hour with this with this idea because we're like, well, there's probably somebody else working on exactly this. And I think the reason that we, you know, we were able to get a spotlight is because we were able to create such a polished paper. We took the time to do the science properly, to get the the base the baselines that we wanted and do all the the the the tasks that we wanted to try. Encouraging researchers to take a little bit more of a of a risk, right, to try these slightly more speculative long term ideas.

I think is is the sad thing is I don't think it's necessarily a very difficult thing to sell. And I want to have the CTM as like a poster child of it works. Right? It was a bit of a risk. We, you we didn't know if we were gonna find something interesting.

But, you know, it was our first shot, and we did find something interesting, and it became a a successful paper.

If if we do find a system which can acquire knowledge, design new architectures, do the the open ended type of science that you're speaking to, Can you see a future where, at some point, the locus of progress will be mostly driven by the models themselves?

I think so. Whether or not that's going to replace us completely, I go back and forth on. Powerful algorithms are finding, helping us do research. Right? So I think it might just end up being a more powerful version of that.

Right? So I I know the the AI scientists that we we released, we showed that you could actually go end to end, right, go from seeding the system with an idea for a research paper, and then just take your hands off and just let it go. Think about the idea, write the code, run the code, collect the results, and write the paper to the point that we were actually able to get it to a 100% AI generated paper accepted to to a workshop recently. Right? But I think we did that to show that you could do it, right, as a sort of demonstration.

In a real system, I think I would want it to be much more interactive. Right? I would wanna be able to see it with an idea and then have it come back with more ideas, have a discussion with me, then go away to write the code. I want look at the code and check it, and then discuss the results as they're coming out. So that's the sort of near term future that I that I would envision or how I would like to do research with an AI.

And could you introspect on that? Is it because you feel we need supervision because the models don't yet understand? You know, there's this path dependence idea. So we need to do supervision because we have the path dependence so we can guide the generation of the language models. Maybe in the future, the language models will just understand better themselves.

But there's also the output dimension, which is that we want to produce artifacts that extend the phylogeny of human interest. We want it to be human relevant.

Yeah. I think it's more that, you know, in that initial seed idea, it's probably impossible to actually describe exactly what you want. It's exactly the same with, you know, when I have an intern. I can't just have an intern come into the company, and I go, I have this mad idea, and then just explain it to them, and then just leave them alone for four months. There's a back and forth because I have a particular idea that I want to explore, and I need to keep steering them in the direction that's I I you know, that I had in my my mind originally.

So I think it's more like that, basically.

You have such a deep understanding. So you have this rich provenance and history and path dependence, and that means you can take creative steps. Intuitive steps for you respect the phylogeny. They respect all of this deep abstract understanding that you have, and interns don't yet have that. Right.

But maybe AI models in the future will have that.

Yeah. Sure. If they if they get to the point where my inputs becomes detrimental, then, yeah, that'll be a thing. It's it's kind of like chess, right? There was a point at which chess engine and human fusion actually beat chess engines.

That's not that's not true anymore. Right? Adding a human into the mix actually makes the the bots worse.

Oh, interesting. I wasn't aware of that.

Yeah. So so what to do when that day comes for AI scientists is a is a is a broader discussion, I think.

I think now is a good segue to talk about this paper in a little bit more detail. So that this continuous thought machine, you were just pointing to it before. Look, first first of all first of all, mate, introduce yourself and set this thing up for us.

My name's Luke. I am a research scientist at Sakana AI, and my primary sector of research is this continuous thought machines. It took us somewhere in the region of about eight months working on this project with the whole team. I I did a lot of the work, but we also had a lot of people in different areas and doing different parts of it that I think an eight month life cycle for a paper seems a bit long for AI research at the moment. But yes, to the actual technical points of the paper.

So we call it continuous thought machines. It originally had a different name. We called it asynchronous thought machines before. But every single time people asked us what's the asynchronous part, it became a bit confusing. So continuous thought machines basically depends on three novelties.

The first one is having what we call a internal thoughts dimension. And this is not necessarily something new. It's related conceptually to the ideas of latent reasoning, and it's essentially applying compute in a sequential dimension. And when you start thinking about ideas and problems in this domain and in this framework, you start understanding that many problems that look like intelligent or solutions to problems that look intelligent are often solutions that have a sequential nature. So for instance, one of the primary tasks that we tested in the continuous thought machines was this maze solving task.

Solving mazes for deep learning is quite trivial. It's really easy to do if you make the task easy for machines. And one of the ways to do this is you give an image of a maze to a neural network, like a convolutional neural network, and it outputs a image, same size of the maze, it's zeros where there isn't a path and there's ones where there is a path. And there's some really brilliant work showing how you can train these in a careful way and scale them up essentially indefinitely. And this is fascinating and a really interesting idea of how to solve this.

However, when you take that approach out of the picture and you ask what is a more human way to solve this problem, it becomes a sequential problem. You have to say, well, go up, go right, go up, go left, whatever the case may be, to trace a route from start to finish. And when you constrain that simple problem space and you ask a machine learning system to solve it like that, turns out to actually get much, much more challenging. So this became our hello world problem for the C tier. And applying an internal sequential thought dimension to this is how we went about solving this.

Two other novelties that we can touch on and talk about, we sort of rethought the idea of what neurons should be. There is a lot of excellent research in this world in cognitive neuroscience, particularly exploring how neurons work in biological systems. And then we get on the other side of the scale how deep learning neurons work, which the quintessential example is a ReLU. It's off or on in a sense. And this very, very high level abstraction of neurons in the brains feels a little bit myopic.

So we approached this problem and said, well, let's on a neuron by neuron basis, let this neuron be a little model itself. And this ended up doing a lot of interesting work on how to build dynamics in the system. The third novelty here is, as I said before, we have this internal dimension over which thinking happens. We ask the question, well, what is the representation? What is the representation for a biological system when it's thinking?

Is it just the state of the neurons at any given time? Does that capture a thought, if you wish? If I can be controversial and use the term thinking and thought. And my philosophy with this is no, it doesn't. That the concept of a thought is something that exists over time.

So how do we capture that in engineering speak? Instead of measuring the states of the model that is recurrent, we measure how it synchronizes, how neurons synchronize in pairs along with other neurons. And this opens up the door to a huge array of things that we can do with this type of representation.

You were talking about this sort of sequential nature of reasoning. And Devil's Advocate, I mean, there was that anthropic biology paper, and they were talking about planning and thinking. And and they they were they were saying that this thing is planning ahead because because I think your system, actually, we can say it does planning. It it it's it's actually different computationally. Can you explain that?

Yes. I think the boundary in terms of computation from a a a Turing machine perspective, if you wish, is really interesting because the notion of being able to write your tape, read from that tape, and then write again to be in a Turing complete system is obviously an incredible idea that has completely changed the world. And I think the primary difference with let's talk about transformers versus what we're trying to do with the CTM is that the process that the CTM thinks in, we can apply that process, that internal process, to breaking down a problem. So the problem itself can be a single, there is a single solution to this problem. And you could do that in one shot.

You could, as I explained with the maze, could just process that in one shot. But there are certain phrasings of problems that are real problems that doing so becomes exponentially more challenging. So in the Maze task, really good example is that if you try to predict 100, 200 steps down the path in one shot, no models that we could train, not even our model could do that. We needed to actually build an auto curriculum system where the model first predicted the first step, and then when it could predict the first step, then we started training it on the second and third and fourth step. And the sort of resultant behavior of this is where it gets interesting.

Ways that I like to do research and that I encourage people who work with me to do research is understand the, if you wish, the behavior of a model. We are getting to a point now where the models that we build are demonstrably intelligent in ways that keep surprising us. And breaking that down into a single set of metrics or even a finite single metric about performance seems maybe not to be the right way to do it for me. And understanding the behavior and the actions that those models take when you put them in a system and train them in a certain way seems to reveal more about what's actually going on under the hood.

Very cool. And I think I didn't pick up on this. So you're doing a fixed number of steps. So you have, like, a a context window. And did you say that you've set that around a 100 steps?

So for the for the maze task, the model always observes the full image. At every step, the CTM will absorb observe the full image. For argument's sake, those images could be tokens from a language the output of a language model. Those inputs could be numbers that that model has to sort, whatever the case may be. It should be agnostic to data.

That's how we've tried to build it. But in the maze task, the model can continuously just observe the data. No matter where it can look at the whole image simultaneously, but it uses attention to retrieve information from the data. And it has, let's call it 100 steps that it can think through. And what we do is we pick up, at some point the model solves three steps through the maze.

So it says I'm going to go up, up and right and then it's correct, but then it makes the wrong turn. And at that point, we stop supervision. We only train it to solve the fourth step. So one more than what it could. In practice, we do it five, but the principle holds.

And when you do that, it's a self bootstrapping mechanism. And I think the intuitive listener will understand how that extends to other domains, other sequential domains, for instance, like language prediction, many tokens ahead, that sort of thing.

So I'm really interested in this idea of adaptive computation. So I guess the first question is, how sensitive was the performance to the number of steps? And then the next question would be, could you have an arbitrary number of steps? Which means that perhaps based on uncertainty or some kind of criterion, you could do fewer steps? And then the final question is, could you have potentially, like, an arbitrary or unbounded number of steps?

Yeah. Really super question. That I'll answer the uncertainty question first about the sensitivity to steps. So a very good example of this is we just trained the model on ImageNet classification, and our loss function is quite simple. What we do is we run it for, for example, 50 steps, and we pick up two points, two distinct points.

The first one is where is it performing the best, I. E. Where is the loss the lowest? And the second one is where is it most sure? Or where is it most certain?

And those give us two indices between zero and forty nine inclusive. And we apply cross entropy at both of those points. We just make the last the average of the cross entropy at those points. So what this does is it induces a behavior where easy examples are solved almost immediately in one or two steps, whereas more challenging examples will naturally take more thinking. And it enables the model to use the full breadth of time that it has available to it just in a natural fashion without having to force it to happen.

So you've decided to model every neuron as an MLP, which is really fascinating, talk about that, but also there's this notion of synchronization. And I think you use the inner product to determine the extent to which the parameters are synchronized, and this kind of unfurls over over time as as the driving force. Can you explain that in a bit more detail?

Absolutely. I think it's a it's a good point to explain the neuron level models as we call them in the paper or NLMs first because it ties into this. So you can imagine a recurrent system is a state vector, a state vector that is being updated from step to step. We track that state vector and that state vector unfolds and for each individual neuron, each eye neuron in the system, we have a unfolding time series. It's a continuous time series, well, it's discrete, but it's a continuous value.

And those time series define what we call the activations over time. Synchronization is quite simply just measuring the dot product between two of these time series. So you have a system of d neurons, and essentially you have d over two squared different synchronization pairs. So neuron one can be related to neuron two by how they synchronize, and neuron one can also be related to neuron three, etcetera, etcetera. The neuron level models, they function by taking in a finite history, like a FIFO queue of activations coming in, and instead of being just a radio activation, they use that history as information to process a single activation out.

And that is what moves from what we call pre activations to post activations. And the principle here is that this might seem rather arbitrary and does it help for performance? It turns out it does, but that's not really the catchall solution here. That's not what we're after. What we're after here is trying to do something biologically plausible.

Find the line somewhere between biology, which is how the brain implements things in the biological substrate that we have, versus deep learning, which is highly parallelizable, super fast to learn, back prop amenable, all of the nice properties of that that have got us this far, and find a line somewhere where we can take some sprinkling of biological inspiration but still train it with deep learning. And it turns out that neuron level models is a nice interim that we can do this with. The concept of synchronization is applied on top of the outputs of those neuron level models.

So on on this on the scaling, I think the time complexity is quadratic in respect of the dimension of the synchronization matrix. Right? And in your paper, were talking about subsampling to improve the performance. But how how did that affect the the the stability and the you know, like, were were there any things that that cost you doing that?

Yeah. It's a neat question. I think in terms of stability, what's what we found was kind of fun, and this was a sentiment that we had throughout the the experiments that we ran with this paper, was it tended no matter what we tried it on, it it just kinda worked with all spreads of hyperparameters. And this the problems that you have with backprop through time, typically with recurrence models like RNNs and LSTMs, it's a challenge. And you run for many internal ticks with the RNNs or the LSTMs and the learning seems to break down.

But the fact that we use synchronization in some sense touches all of the neurons through all of the time. So it really helps with gradient propagation. A nice interesting point that's maybe a bit oblique to what you asked about synchronization is we have a system of de neurons and like I said earlier, there are d over two squared possible combinations. This essentially means that our underlying state or underlying representation to the system is quite a lot larger than what you would get with just taking those de neurons. And as to what that means in terms of downstream computation and performance and the things that we can do with this is what we're actively exploring right now.

You guys used an exponential decay rate? You have

the system that unfolds over time. It would be maybe a little bit too constrained if the synchronization between any two neurons depended on the same time scale. So for instance, there are neurons in your brain that are firing over very long time scales and very short time scales. The way that they fire together impacts other neurons and causes those neurons to fire. But everything in biological brains happens at diverse time scales.

It's why we have different brain waves for different thinking states, for instance. But beside that point, what we do with the exponential decay in the continuous thought machines is it allows us for a very sharp decay to say that these two neurons that are pairing together, what only really matters is how they fire together right now. Right? But if we had a very long and slow decay, essentially that's capturing a global sense of how those neurons are firing over an extremely long period of time. So this was essentially a way of us capturing this idea of how different neurons could maybe fire together very quickly and other neurons can fire together very slowly or not at all.

And this lets that representation space that I spoke about, that d over two squared representation space, lets it again become more rich, and we can enrich that space with more subtle tweaks to how we compute those representations.

So we were speaking about this yesterday, Luke, that when folks apply transformers to things like the ARC challenge or things that need reasoning, we need to do lots of domain specific hacks. So the architects who were the winners of last year's challenge, they did depth first search sampling. And some folks have been experimenting with using language representations or using DSLs. And some part of this is to do with the reachability of language. And language is quite dense, which means you can kind of monotonically increase.

But if I understand correctly, your system might have some interesting properties for reasoning and for discrete and sparse domains, and also for sample efficiency. Because we want we want to build a system that can actually do well on things like the ARC challenge. But can you kind of explain in simple terms why you think this architecture could be significantly better than transformers for doing those things?

I think a lot of the really fascinating work in the last few years that I found fascinating in the literature of language models has been related to what one can actually call a new scaling dimension. I in some sense see a chain of thought reasoning as a way of adding more compute to a system. That's obviously just one small part of what that really is and what that really means, but I think it's quite a profound breakthrough in some sense. Now what we're trying to do is is have that reasoning component be entirely internal yet still running in some sort of sequential manner. And I think that that's rather important.

And you spoke earlier about Gemini's diffusion language modeling, and I think that there are a lot of different directions that are exploring this right now. I do think that the continuous thought machine with the ideas of synchronization and multi hierarchical temporal representations gives a certain flexibility on that space that other people are not yet exploring, and that richness of that space being able to project the next step to solve the ARC challenge and the next 100, the next 200 steps to be able to break that down into a process that a model can then very quickly search that process in its high dimensional latent space becomes something that feels like a good approach to take.

Do you see any relationship between the this architecture and, you know, Alex Graves' neural neural Turing machine?

Yes. That's really interesting. I do. I think that one of the most challenging parts about working with a neural Turing machine is the concept of writing to memory and reading to memory because it is a discrete action. And that has its own challenges associated with it.

And, yes, I wouldn't go so far as to say that the continuous thought machine is a definitive linear in Turing complete, but the notion of the notion of doing reasoning in a space that is latent and letting that space unfold in a way that is rich towards a different set of tasks. And this this actually brings me to a point that I find quite interesting that I'd like to share with you. Consider again the ImageNet task or any sort of classification task. It's a nice test bed. There are many images that are really easy and there are many images that are really difficult.

When we train, for instance, a VIT or a CNN to do this task, it has to nest all of that reasoning in the same space. It has to put all of its decision making process for a very simple obvious cat versus some complex, weird, underrepresented class in that system in that dataset. And it has to nest it all in parallel in a way that is we get to the last layer and then we classify. I think breaking that down where you have different points in time where you can say, now I'm done, I can stop, versus now I'm done, I can stop, lets you take a dataset or take a task and actually naturally segment it into its easy to difficult components. And I think we know that curriculum learning and learning in this continuous sense, again, seems to be a good idea.

It's how humans learn. And if we can get at that architecturally and just have that fall out in a model, again, this seems like something worth exploring. I'm not sure if you know much about model calibration and how neural networks tend to be poorly calibrated.

Oh, go for it. Tell me.

It's it's a bit of an old finding, but if you train a neural network for long enough and it fits really, really well and you've regularized it regularized it really, really well, you'll find that the model is uncalibrated, which essentially means that it is very certain about some components where some classes where it's wrong and uncertain for some classes where it's correct. Essentially what you want for a perfectly calibrated model is if it predicts a probability that this is in class, the correct class with 50%, 50% of the time you want it to be correct about that class and so on and so forth. So well calibrated model, if it's predicting a probability of 0.9 that it is a cat, then 90% of the time it should be correct. And it actually turns out that most models that you train for long enough get poorly calibrated. And there are loads of post hoc tricks to fixing this.

We measured the calibration of the CTM after training and it was nearly perfectly calibrated, which is, again, a little bit of a smoking gun that this actually seems to be probably a better way to do things.

The flavor of this kind of research is such that we didn't actually go out and actually try to create a very well calibrated model. Right? And we didn't even try to create a model that was necessarily going to be able to do some kind of adaptive computation time. Right? I was a very big fan of the the paper, Yeah.

Adaptive commutation time was Alex Graves, was it? But that paper, it had a massive amount of hyperparameter sweeps in it, because in that paper, he needed to have a loss on the amount of computation that was being done. Because anytime you try to do some sort of adaptive computation time research, what you're fighting is the fact that neural networks are greedy. Right? Because, obviously, the way to get the lowest loss is to use all the computation that you have access to.

So unless you had, like, an extra loss that had a penalty that said, okay. Actually, you're not allowed to use all the computation. That's and and very, very carefully balanced loss. That's when you actually got the interesting dynamic computation time behavior falling out of the the model in that paper. But what was really gratifying to see with the the continuous thought machine is that because of the way that we set up the loss that Luke described earlier, adaptive commutation times seem to just fall out naturally.

So that's more the way that I think research should go. Okay? Because we we don't actually have, like, a specific goal, like, a specific problem we're trying to fix like that or something we're trying to invent. It's more that we have this interesting architecture and that we're just following the gradients of interestingness.

Yes. And on that point, I think maybe the most exciting thing about your paper is, you know, we were talking about path dependence and having this understanding, which is built step by step, this process of complexification. And, I mean, maybe this is this is apropos in in the theme of world models in in general and also active inference. And I say active inference in big quotes because it's not Carl Friston's active know, maybe adaptive inference or something like that. But we we wanna build agents that can continue to learn, that can update their parameters, and most importantly, can construct path dependent understanding.

And because it that's completely different to just understanding what the thing is. It's how you got there is very important. And this architecture potentially allows these agents using this algorithm to explore trajectories in spaces, find the best trajectories, and actually construct an understanding which carves the world up by the joints.

Yeah. That's a that's a really neat perspective. I haven't actually thought about it like that. But yes, I think that particular stance becomes really interesting when you think about ambiguous problems. Because carving the world up in one way is as performant as carving it up in another way.

You know, perhaps the hallucination in language models is carving the world up in some fine way, but it's just not performance in our measure of this is hallucination and actually that's not true. But in some other trace down the path of wanting to carve the world up through a autoregressive generation of tokens, you end up in a different carve up of that world. And being able to train a model that can be implicitly aware of the fact that it is actually carving up the world in a different way and can explore those manners, those descends down the carve up is something that we're after. I think it's quite an exciting approach to be trying to take a stance of let's break up this problem into small solvable parts and learn to do it like that. And how can we do this in a natural way without too many hacks?

Yeah. It it's something I've I've been thinking about because Cholet, as much as I love his measure of intelligence ideas, is for him adapting to novelty is getting the right answer. And the reason why you gave that answer is very, very important. And in machine learning, we have this problem that we we come up with this kind of cost function that rather leads to this shortcut problem. But, we could just build a symbolic system where we could be go FI appealed and we could say, okay, we need to do this principled kind of construction of knowledge maintaining semantics.

Well, we're not doing that. We're doing a hybrid system. But there must be some natural way of doing reasoning where in spite of the end objective being this cost function, that because of the way that we traversed these open ended spaces, that we can actually have more confidence mechanistically that we're doing reasoning which is aligned to the world.

I think that's a great way of seeing this particular avenue of research. And I think that's obviously, we're not the only people thinking like this, and we're not the only ones trying to do this. What we have is an architecture that's amenable to it. And surprisingly so, it wasn't, again, wasn't the goal. It's not the goal to do this type of research.

It's not the goal to be able to break the world down into these small chunks that we can actually reason over in a way that seems natural. Instead, what we did was pay respect to the brain, pay respect to nature, and say, well, if we build these inspired things, what actually happens? What different ways of approaching a problem emerge? And then when those different ways of approaching a problem emerge, what big philosophical and intelligence based questions can we then start to ask? And that's where we're at right now.

So it might feel, at times, especially for me, too many questions and too few hands to answer those questions. But I think the fun and exciting thing and the encouraging thing that I can try to encourage other younger researchers out there is that, you know, do what you're passionate about and figure out how to build the things that you care about and then see what that does, see what doors that opens up and see how to explore deeper into those domains.

We were talking about this yesterday, weren't we? That you can think of language as being a kind of maze. Yes. Like, what is to stop us from taking this architecture and building the next generation language model with it?

I mean, that that's honestly, as you know, something that I am actively trying to explore right now. And, yeah, I think the maze the maze task gets really interesting when you add ambiguity to it, when there are many ways to solve the maze. Honestly, this isn't something I've tried yet, and maybe it's something I should try next week. But it's essentially you can imagine an agent or the CTM in this case observing the maze and taking a trajectory. And surprisingly, we saw this.

We have a section in our recently updated paper on archive, the final camera ready version of this paper, where we added an extra supplementary section that is not in the main technical report. And that supplementary section is basically, hey, we saw this cool stuff happen. And we list, I think, 14 different interesting things that happened while we were doing the research that obviously didn't make it into the paper, but we wanted people to know about these strange things that happened. And this is one of the strange things where we watched during training what was happening. And at some time during training, maybe halfway through the training run, we could see what the model would do is it would start going one path in the maze, and then suddenly it would realize, oh, no.

Damn. I'm wrong. And would backtrack and then take another path. But eventually it gets really good and it does some sort of distributed learning in this because it's got a attention mechanism with multiple heads, so it can actually just figure out how to do this pretty well and refine its solution. But sometime early on in the learning, it descends multiple paths and comes back and backtracks.

We have a really fascinating set of experiments that also showed and this this we actually have some supplementary material online showing this where and I don't really know what this says. It's kind of a deep philosophical thing. But if you're trying to solve a maze but you don't have enough time, turns out that there's a there's a faster algorithm to do it. And this was this blew my mind when I saw it. So if we constrain the amount of thinking time that the model has but still get it to try solve a long maze, instead of tracing out that maze, what it does is it quickly jumps ahead to approximately where it needs to be and it traces backwards.

And it fills in that path backwards. And then it jumps forward again, leapfrogs over the top and traces that section backwards, then leapfrogs. And it does this fascinating leapfrogging behavior that is based on the constraint of the system. And again, this is just an observation we made and what that means in a deep sense and how it's related to giving a model time to think versus not and is it enough time to think, what happens, what different algorithms does the model learn when you constrain it in this way, I find that quite fascinating and an interesting thing to explore. Does it tell us something about how humans think?

Does it tell us something about how we think under constrained settings versus open ended settings? There's a number of cool questions you can ask on this front.

You guys are both huge fans of population methods and collective intelligence and because we can we can scale this thing up and we can scale it out. And what would it mean to scale this thing out? Not only just in a kind of what do they call it? Trivial parallelization, but in terms of having some kind of weight sharing between parallel models and so on. What what what would what would that give you potentially?

This is this is a fun area of research. So one of the active things that we're trying to explore in our team is concepts of memory, long term memory, and what what does this mean for a system like this? So an experiment that one can construct, for instance, is to put some agents in a maze and let them try solve this maze, not not how we did it in the paper, but in a very constrained setting where a agent can only see maybe a five by five region around it. And we give that agent some mechanism for saving and retrieving memories. And the task, if you wish, is to solve that maze, find your way to the end.

And the model needs to learn how to construct memories such that it can get back to a point where it's seen before and know, I did the wrong thing last time and go a different route. And you can then see this with parallel agents in the same maze with a shared memory structure and see what actually happens when you can all access that memory structure and have a shared global, like, almost like a cultural memory that we can access and solve this global task by having many agents trying to use this memory system. And I do think that memory is going to be a very key element to what we need to do in the future for AI in general.

So the subject of reasoning came up just a second ago, and I think there's a perception that recently we made a lot of progress in reasoning. Right? Because it's actually one of the main things that I think people are are working on. We released a dataset recently called Sudoku Bench, and I was actually quite happy to see it come up organically on your podcast a few weeks ago.

Chris Moore.

Right. Yes. So I I wanted to tell you a little bit about this benchmark because I think I've been having a little bit of issue promoting it because it doesn't, on the surface, sound particularly interesting. Because Sudoku has a sort of a feeling that it's already been solved. Right?

So how interesting can a collection of of Sudokus be for reasoning exactly? We're not talking about normal Sudokus. We're talking about variant sudokus. And what variant sudokus are are usually normal sudokus. Right?

So put the numbers one to nine in the row of the column in the box. But then literally any additional rules on top of that. And they're all handcrafted. They all have extremely different constraints. Constraints that actually require very strong natural language understanding.

So for example, there's one puzzle in the datasets where it tells you the constraints of the puzzle in natural language and then says, oh, by the way, one of the numbers in that description is wrong. Right? So you have to be able to meta reason about the rules themselves even before you start solving the puzzle. There are other puzzles where you have a a maze overlaid on the sudoku, and the rat has to work out a way through the maze by following a path to the cheese, but then there are constraints on the path that it takes of, like, what numbers and what they can be add up to. It's difficult to really describe how varied these these these variants, Sudoku's are.

And I think they're so varied that if anyone was actually be able to beat our benchmark, they would necessarily have to have created an extremely powerful reasoning system. Right now, the best models get around 15%, but they're only the very, very simplest and the very smallest Sudoku puzzles in in the set. We're gonna be putting out a blog post about GPT five's performance, and it is a jump, but it's still completely unable to solve puzzles which are, you know, humans can can solve. And what I really like about this data datasets, and actually was the catalyst for me creating it in the first place, it was that there was a there was a quote by Andrei Kaparthi saying, okay, so we have all this data. It's from the Internet.

But what you really want right? If you wanted AGI, you wouldn't want all of the text that humans have ever created. You would actually want the thought traces in their head as they were creating the text. Right? If you could actually learn from that, then you would get something really powerful.

And I thought to myself, well, that data must exist somewhere. My first thought was maybe philosophy, like, you know, there's there's a type of philosophy where you just write down your thoughts without thinking, like just stream of consciousness. I thought maybe that could work. But then when I wasn't thinking about it and I was, you know, in my leisure time, I was watching a YouTube channel called cracking the cryptic Yes. Where these these two British gentlemen will solve these extremely difficult Sudoku puzzles for you.

Right? Sometimes their their videos are four hours long, and they're they're professionals. Like, this is their job. And what was perfect, I realized, is they tell you in agonizing detail exactly what reasoning they used to solve those particular puzzles. Right?

So we, with their permission, took all of their videos, which represents thousands of hours of very high quality human reasoning, like thought traces, and scraped them and made that available for imitation learning. Right? We did try to do this internally. Turns out that I did a little bit too much of a good job of really creating a very difficult benchmark. Right?

So we're still trying to get that stuff working. We publish it that if we if we have some success. Yeah. I wanna I wanna really sell the fact that this this reasoning benchmark really is different. Right?

Not only do you get something that's super grounded, like, know exactly if it's right or wrong, so you can do RL to your heart's consent, but you can't generalize very easily. Each puzzle is deliberately designed by hand to have a new and unique twist on the rules called a break in that you have to understand. And right now, despite all the progress we we've made, the current AI models can't take that leap. They can't find these break ins. Right?

They'll fall back to, okay. I'll try no I'll try five. I'll try six. I'll try seven. Right?

The the reasoning becomes really boring and nothing like what you see in the transcripts that we've we've open sourced from this from this YouTube channel. So I just wanna put the challenge out there, right, that this this is a a really difficult benchmark, and I think progress on this benchmark will really mean progress in AI generally.

Could you reflect a bit? So after watching this cracking the cryptic YouTube channel, how diverse were the patterns? Because Chris was saying to me, oh, you know, these guys, they go on Discord servers. They get these creative crazy ideas. And I'm I'm obsessed.

Maybe maybe I'm just being idealistic, but I love this idea of there being a deductive closure of knowledge. Right? That that there's this big tree of of reasoning, and we're all in possession different parts of the tree to different depths. So the smarter and the more knowledgeable you are, the deeper down the tree you go. But in this idealized form, there is one tree, and all knowledge kind of, you know, originates or emanates from these abstract principles.

And we could, in principle, build reasoning engines that could just reason from first principles. And it might be computationally irreducible. So you'd say you have to perform all of the steps. And it feels like because we're not in possession of the full tree, what we need to do is kind of fish around. We fish around to find the Lego blocks.

Oh, that's a good Lego block. I can apply that to this problem. And maybe that's just what we need to do in AI for the time being is is is we need to just acquire as much of the tree as possible. But could could we just do it all the way down?

Yeah. Fascinating question. That tree is probably massive. Right? And as a human is solving these puzzles, they're definitely learning in real time and discovering new parts of this tree.

And it's it's sort of a meta task. Right? Because it's not just reasoning. You're reasoning about the reasoning. And I don't think we can we have that in AI right now.

Because if you watch the videos, they'll say something like, okay. This looks like a parity task, or this is a set theoretic problem, or, you know, maybe I should get my path tool out and trace this this around. And, of course, the professionals, they do have this this already massive collection of reasoning LEGO blocks, as you say, in their head. So they'll recognize, okay. That type of rule usually needs this kind of LEGO block.

It's actually fascinating to watch how good they are at just intuitively knowing where, you know, someone like me who's haven't solved as many needs to spend a lot of time looking around like, okay. Maybe maybe I should try this or maybe I'll try this one. But even they're not perfect, so you can watch them take a certain kind of reasoning and start building up. Okay. Maybe we should solve it like this and then go and no.

That doesn't disambiguate it enough, and then backtrack and then go down another path. Again, something that we do not see current AIs doing when they're trying to solve this this benchmark.

The the tree is very big. And I guess the phylogenetic distance between many of these motifs in in the tree is just so large. So it's so difficult to jump between. And and I and I think that's why as a collective intelligence, we work so well together because we actually find ways to jump to different parts of the tree. Right.

And I and I think that's probably why the RL, the the current state of the RL algorithms that we're trying to apply to this just isn't working. Because in order to learn how to get these breakthroughs, to to understand what the sort of nuanced reasoning is to get these puzzles, you have to sample them. And that it's it's such a rare space. You know? It's it's such a specific kind of reasoning that's required to get to the the specific breakthrough that this kind of technique doesn't work.

Right? And there's definitely a feeling in the community like, okay. This is how you just solve things now. Like, we have RL. Yes.

We can get these language models to do what we want. It doesn't work for this for this dataset.

Guys, it's been an absolute honor having you on the show. Just before we go, are you hiring? Because we we've got a we've got a great audience of ML engineers and scientists, and I think working for Socano would be the dream job.

That's very kind of you. Yes. We are definitely hiring. And as I said earlier in this interview, I honestly want to give people as much research freedom as possible. I'm willing to make that bet.

Right? I think things that are very interesting will come out of this, and I think we've already seen plenty of interesting things coming out of this. So if you want to work on what you think is interesting and important, come to Japan.

And Japan just happens to be the most civilized culture in the world.

Alright.

It might be the opportunity of a lifetime, folks. So, yeah, get in touch. Guys, seriously, thank you so much. It's been an honor having you both on the show.

Thank you very much.

Thank you so much. It's been great.

Machine Learning Street Talk (MLST)

"I Co-Invented the Transformer. Now I'm Replacing It." & Continuous Thought Machines - Llion Jones and Luke Darlow [Sakana AI]

0:00 / 0:00

View original episode →