Episode	Podcast	Published	Duration	Status

Machine Learning Street Talk (MLST)

The Mathematical Foundations of Intelligence [Professor Yi Ma]

December 13, 2025•1h 39m•15,422 words

Description

What if everything we think we know about AI understanding is wrong? Is compression the key to intelligence? Or is there something more—a leap from memorization to true abstraction? In this fascinatin...

Go. You do the go. You do the

Punt with Labbrokes and get direct access to bet with the biggest names and sharpest minds in punting with hosted pots. Buy in, pull your cash with like minded punters, and let the hosts place the bets, sharing every move live in the chat so you can get in on the action and the banter.

You good thing. Go, you good thing.

Tees and tees apply and available on website. Imagine what you could be buying instead. For free and confidential support, visit gamblinghelponline.org.au.f

won the movie is now on Apple TV. No pressure on. Brad Pitt.

If everyone makes one critical contribution Come on. That's the

difference between laughs and burgers. F won the movie now streaming only on Apple TV. Subscribe to watch on any device with the Apple TV app. Terms apply.

In in the past ten years, I think the question about the intelligence or artificial intelligence has captured people's imagination. I'm one of them, but it took me about ten years to try to really understand, can we actually make understanding intelligence a truly scientific or mathematical problem to formalize it? You probably will get some of my opinion and also the facts about it. And probably will change your view of what is intelligence, which is also a very soul searching process for me. How do we clarify some misunderstandings, the common misunderstandings about intelligence?

Through this journey, maybe we'll gain an entirely new view about what we really have done, right, in the past ten years, the practice of artificial intelligence, the mechanism we have implemented, the order, because behind all the large models, deep networks, and the large models are truly are, and the true natures, and hence understand their limitations, and also what it takes to truly build a system that has intelligent behaviors or capabilities. I think we have reached the point where we will be able to address what's next for understanding even more advanced forms of intelligence. What's the difference between compression and abstraction? Difference between memorization and understanding? I think of a future, those are the big open problems for all of us to study.

MLST is supported by Cyber Fund. Link in the description.

The idea of having to traffic in squishy people in order to make our systems go is not immediately appealing. Let's put it that way.

This episode is sponsored by Prolific.

Let's get few quality examples in. Let's get the right humans in to get the right quality of human feedback in. So so we're we're trying to make human data or human feedback. We treat it as an infrastructure problem. We try to make it accessible.

We're making it cheaper. We effectively democratize access to this data.

Professor Ma, it's amazing to have you on MLST. Welcome.

Thank you for having me. So normally, I

ask guests to introduce themselves, but given your your stature in the field, I think it's best that that that I give you an introduction. So Yi Ma

is a world leading expert in deep learning and artificial intelligence.

He's the inaugural director of the School of Computing and Data Science at Hong Kong University and director of the Institute of Data Science at the University of Hong Kong. He's also a visiting professor at UC Berkeley, where he previously served as a full professor in electrical engineering and computer science. He's an IEEE fellow, ACM fellow, and Centimeters fellow whose pioneering work on sparse representation and low rank structures has fundamentally shaped modern computer vision and machine learning. His recently published book, Learning Deep Representations of Data Distributions, proposes a mathematical theory of intelligence built on two principles, parsimony and self consistency. This framework has led to white box transformers known as crate architectures, where every component can be derived from first principles rather than empirical guesswork.

So, professor Ma, tell me about your book.

You know, about, you know, eight or seven years ago, you know, you know, deep networks, deep learning has been pretty, you know, the practice of machine learning or artificial intelligence in the past decade. About eight years ago, I had a chance to get back to Berkeley and

gave me

a chance to look into these topics more deeply, try to understand it from more principled approach. And hence the book is kind of summarizing the kind of progress we've made in the past, you know, eight or beyond years, myself, my group, as well as many colleagues, trying to understand the principles behind the deep networks, explained it from our first principle. And in that journey, we also seems to embark a little bit beyond that, find something probably more general behind that is the intelligence at a certain level of intelligence. Hence, when I get back to join Hong Kong U about two years ago, had a chance to design or redesign some of the curriculum to reflect those, some of the progress updates in our field, the rapid progress in our field. So my students and colleagues decided that maybe it's time to systematically organize this body of knowledge and reflect as textbook, as well as a new course, which I'm teaching this semester.

And likely will be offered as well at Berkeley next semester. So this is actually a first time probably we try to provide a more principled approach to explain the the deep networks, as well as some principle of intelligence.

And these principles are parsimony and self consistency. So it's it's an ambitious idea that these principles could explain natural and artificial intelligence. What do you mean by that?

Intelligence, artificial or natural, or whatever adjectives you add to intelligence, we have to be very specific. It's a very loaded world, right? I mean, even intelligence itself may have different levels, different stages, right? So it's high time we clarify that concept scientifically and mathematically, right? So that we'd be able to talk about study intelligence, the mechanism behind it at each level.

There are some more unified principle behind even different stages of intelligence. There's something in common. There are also things that are different. So, it's high time we do that. One of the intelligence at the level is that common to animals and human, right?

Human, we are animals. That level of intelligence is what we think are very common to all life, which is how memory, how we learn knowledge about the external world, and then memorize as part of our memory, and use that to predict, to react to the world, help us make decisions, predict and make better decisions for survival and so on and forth. That's very, very common. And this is the level of intelligence we're talking about very much for the book as well, right? And hence, for this level of intelligence, how our memory works, today we also have a fancy word for memory.

We call it a world model, right? And it's how we develop such a memory, such a world model, and how the model gets evolved, and how we use it. That's actually, this is the level of we talk. So we actually believe that for this level of intelligence, for how our memory formed and how they work, is precisely the two principles are incredibly important. And we believe they are necessary, is that memory or knowledge is precisely trying to discover what's predictable about the world.

Hence, for all understand, all such information are have intrinsically very low degree of freedom. We call it low dimensional structures. And hence, the way to pursue such knowledge is precisely through trying to find the most simple representation of the data. And hence, compression, denoising, dimension reduction is actually all just different words to pursue such knowledge, such a structure. And hence, that's the word captured by the word parsimony.

Finding, you know, explaining, making things as simple as possible, but not any simpler, right? So, this is Einstein says this word, this is the sentence Einstein used to describe science. Actually, that is also what the intelligence, at least at this level, is precisely doing the same thing. The second part of the sentence, not any simpler, precisely says consistent, consistency. Make sure your memory is actually consistent with be able to recreate, simulate the world just right, not any similar.

If you're simpler, you may lose part of the predictability, and also ability to predict it well. And so that's actually the those two actually coexist, we believe. These are the two principles, parsimony and the consistency or self consistency, are actually the two characteristic about how our memory works.

So we want to have understanding which carves the world up by the joints, which represents the important invariances in the world. And the thesis is, I think, that compression might be necessary for understanding. My possible concern with that is that what we are doing with machine learning is representing extant examples of a long phylogenetic tree of evolution.

Mhmm.

So to what extent does knowing their representation now help us? Do we also need to know how they evolved and where they might go in the future?

The the process to acquire knowledge, to gain information about our side world, that's a compression. Find what is compressible. What has orders? What phenomena has orders? Has low dimensional structures that allow us to predict, to rule out variabilities, to predict world tomorrows, or predict world better, in essence.

So in a sense that that is the ability we believe is really what intelligence is all about, At least the common intelligence we're talking about, right? We can talk about the more higher level intelligence later. And if you look at the history of life, how life was developed, So where did you come to believe, right? You know the mechanism that we laws that governs the physical world, we call it physics, right? But what is the mechanism that governs the evolution of life?

I think it's intelligence, right? Even the process you mentioned that through the evolution and life evolves, precisely they learn more and more knowledge about the world, and they encode them through DNAs to pass it on to next generation. And that is a compressing, that's a process to compress knowledge that learn about the world through our DNAs. But the mechanism to update it is actually very brutal, very brute force, and through, you know, random mutation and natural selection. Yes, it does evolve.

It does advance, but at a huge cost of resource, time, and also very unpredictable. Which if you're acute, you probably observe there's some similarity with how current big model evolves, right? Many, many groups try without principle, trial and error, empirical. And the lucky ones survive and gets advocated everywhere and become very, very popular, right? Dominate the practice.

So in a sense that it can make it an allergy, right? I think to the people, you know, students ask me at which stage our artificial intelligence is at today. Then there's already an allergy in nature, right? We are very much at the early stage of the life form, right? And so hence, that is a compression process.

That's a process that also gain knowledge about the world. But of course, later on, we develop individual animals to develop the brain, develop neural systems, develop senses, including visual and touch and so on. So we actually start to use a very different mechanism to learn, to compress our observations, to learn knowledge, and to build memories of Oxford world. And even individuals start to have that ability, rather than just inherit knowledge from their DNAs. So that's a different stages about and then that part of the knowledge is no longer encoded in our genetics, in our genes, but also in our brains.

And that's actually a level of intelligence we talk about most of the time these days, you know, which is common to animals, which is common to humans, and the knowledge or the intelligence we talk about what brain functions.

Yeah, mean, I think we would definitely agree with the statement that intelligence as a system produces artifacts. So Charle's example is a road building network. It produces roads, and the system has adaptivity because it can create new routes where they weren't there before. And then there's the question of, well, there are many ways to compress a thing.

Uh-huh.

So some ways of compression represent the world at a deep abstract level, and some don't. So we might argue that LLMs today, even though they do compress the data, they only compress it in a superficially semantic way.

Mhmm.

And then there's this notion of, well, maybe we agree that intelligence is about the synthesis of new knowledge. So it's the acquisition of new knowledge. But we can only do that if the knowledge we already have represents the world at a deep abstract level. So rather than it being random mutations in evolution, it's very, very structured because the processes are physically instantiated, which means rather than just doing something completely random, it's guided by the process which created them. There are a lot of confusions about what is knowledge, and also what's the process to gain knowledge.

Right? For example, many people says, what are

all this language model is doing to the language? Which is, by the way, it's very different. Don't forget our language is the results of compression. Our language is precisely the code we learned to represent the knowledge we learned through our physical senses about external world, through billions of years of evolution, or billions of years as our brain evolve, right? It's a result of that.

Our knowledge is that that actually represents knowledge. And hence the language, the lateral text, we use the language, lateral language to encode other knowledge common to all people. Now, we're using another model, another compression process to memorize it. In a sense that that mechanism, you can argue that mechanism, what are those large language model is doing, are further treat those text as raw signals to further, through compression, to identify their statistical structures, internal structures. What it's doing is actually not very clear.

It's maybe will just help us to memorize the text as they are, and regenerate them, right? And it's not going through a process, just like how we acquire our how our natural language was developed is through a very long our language are actually grounded with our physical senses, our world models, as we know as memory, right? Our language precisely try to describe that. It's abstraction of that world model we have in our brain. It's a small fraction of the knowledge that we're sharing with each other, right?

Far smaller than the actual model we have. We have many things in our senses, in our memory. Have there's no way we can express them in words, right? Hence, you can think, don't forget that, right? Our brain, there's a small fraction in process natural language.

But the majority is processing visual sensors and also motion sensors data. So that actually says how what the role of language is doing. Now, people actually very much confuse the process. Well, actually now with natural language model, we use to process, reprocess the natural language, or the knowledge as we know, the human knowledge, common to human society. We know this fraction of knowledge is actually very similar to what maybe we're treating our raw data, all right?

Hence, you can see that even the mechanism we transformers or whatever architecture we do. And we are very common to reprocess them, for example, the videos, visual data, or we're treating languages as visual data. We tend to confuse, right, the mechanism. We actually probably memorizing or extracting knowledge from other senses with what we should be doing with confuse that process with probably a large language model is actually understanding the lateral text, right? So there is a very fundamental differences, right?

Using the same process to compress raw data, to extract statistic correlations to stack low dimensional structures. That process works for the census, right, to build the knowledge. But we're actually applying that to our knowledge to our natural languages. We pretend that is understanding. That's probably not.

You cited Max Bennett in your talk. By the way, folks at home, you should watch professor Ma's talk on on intelligence and some of the ideas in in your book, and I'll put a link in the video description. But you cited Max Bennett, and I read his book, A Brief History of Intelligence. And he also had this really interesting idea that that language is basically a set of pointers, and we're actually sharing simulations and second order simulations. And and in a sense, those are just pointers to the simulations, and that's where a lot of the semantic content is.

But you also spoke about levels of intelligence. So it it's a wonderful idea, isn't it, that we have this phylogenetic information accumulation, which is very slow, so, you know, every single physical generation. And then we have this ontogenic accumulation in our lifetimes, and then we have social accumulation. So we have a big hard drive in the sky where we kind of store these pointers. And then we also do science, which is very abstract where we we derive well, we we just hypothesize about things.

And what's very interesting about science is, you know, there's always been this division between empiricism and rationalism. So empiricism says if the ideas on the stage, it's in my mind. And this other idea is that sometimes we just conjure things which are not in the data. So that science thing is very interesting.

Exactly. So see, you just mentioned the four stages that we sort of elaborate a little bit in our book, right, to help people. Those four stages actually have something in common. The common is exactly try to extract structures from the data, and record them, record that what's predictable and through compression, through denoising, through dimensionality reduction, capture a correlation in the signals. And I use that for predictability, for predictions, and so on.

And that's in common. Hence, for all four stages of intelligence, the principle of parsimony, the principle of self consistency or consistency at work. However, through different mechanism, right? The different code book, different mechanism for updating, for optimizing the information, or even acquiring the information. That's actually very important to know.

Now, the main thing you mentioned, also it's point of confusion, is precisely the animals or humans, even social science at the stage of before science appear, right? Almost you can see almost all the knowledge we gain are through empirical approach. Kind of passive. We observe. And we try, we make some errors.

We learn from mistakes, then we record that, right? For example, Chinese medicine, Indian medicine, for many, many years it works, right? It's similar it's very similar to even it became similar to even how DNA evolves. We search in a sort of a less organized way, and somewhat by chance, some by accidents, and we accumulate those knowledge, become very, very useful. We understand how weather changes, how planets moves in a very empirical way, right, for many, many years.

So, and we share that. We rewritten them in languages, in text, and pass those knowledge down to next generation, similar to what DNA does to the knowledge they learned through different process. Now, there's a huge distinction. I call it, you know, around 3,000 years ago. We don't know what happened, to be honest, right?

We're able to maybe the whole process to acquire empirical knowledge is just through compression. But suddenly, we'd be able to do somehow to do abstraction, right? We start to develop knowledge that's far beyond empirical observations, which you think about is very for example, the notion of numbers, natural numbers. We count up to 500. But suddenly, we realize this process goes to infinity, right?

Kids start to do that, right? When they're in middle school, they start to everybody most people start to get that point. And there's an amazing thing about, you know, six thousand years ago that all the, you know, the formulate the geometry, right? One of the assumptions, I don't know if people know this, is very long empirical, which says two parallel lines never intersect. While the word lever is actually infinite, suggesting, it's something you never would observe empirically, right?

How did we come up with that idea? How what happened to our brain? We are starting to jump from compressing empirical knowledge, identify correlation, to something we are actually formalized, abstract, right? Hence, in my book, asked in the towards the last chapter, I asked, is there a difference? Is compression or there is?

Probably not, right? Abstraction is definitely related to compression, but there seems to be something different, something more, right? You know, Karl Popper, right, the famous, you know, scientific philosopher said that science is the art of oversimplification, right? The ability to abstract, we'd be able to hypothesize things. And there is a distinction between hallucination and hypothesizing, I believe.

But what is it? All right. And we know through compression, we can memorize, we can learn the data distribution. We can memorize, we can find a good, very good representation. We can even use that representation to regenerate data with the same distribution.

We call it memorization. But is there a difference between memorizing those data distribution than understanding it? We can emulate in how we conduct logic duration, right? Just like now we can supervise fine tune, or through channel thought, and also using the reinforced learning to force the large language model to emulating, memorizing how we solve logical problems, how we solve mathematic problems. But that is that solution is based on based upon understanding logic, understanding the necessity of logic, mastering the mechanism logic, applied it, or it's just emulating the process.

So, we don't know. So, we have a lot of question now, is the difference between compression and abstraction? Is the difference between memorizing and understanding, right? And it's kind of similar to when Turing was faced the question, what is computable, what is not computable? We know there is a difference, but how do you crystallize that?

Or we also now know, is P equal to NP? Right? We know there might be a distinction. Can we formalize that question? If we believe there's no difference, prove it.

Or if there is a difference, how can we qualitatively or quantitatively be able to see what's more beyond compression that will take us to the level of abstraction. And hence, I believe this is the so that we call it a phase transition, right? From empirical knowledge, developing the empirical knowledge, to scientific knowledge. And what is the distinction? To me, sometimes I call the last stage of, you know, the stage of intelligence as the true artificial intelligence.

If people ever bother to read the proposal laid out by the folks in 1956, you'll find that actually that's the level of intelligence they're truly meant, Okay, to work on. But yet, from all we understand about the practice in the past decade, right, we very much are reproducing the kind of mechanism that is at the level of memory, how to form empirical memory. In fact, I believe that even the large language model are precisely memorizing the large volume of text through how we form empirical knowledge, right? It's using the same mechanism, how we form empirical knowledge to memorize in the natural language, the knowledge is encoded in natural language. Hence, whether that is equivalent or equal to understanding, That's a big question mark.

Time for an upgrade. Officeworks' Copilot plus PCs powered by Snapdragon X Series processors with up to twenty two hours of unplugged battery life. That's over two times more than five year old Windows PCs. So take charge all day. Tap the banner to shop now at Officeworks.

Local video playback. Battery life varies significantly by device and with settings usage and other factors. See a ka.ms/cp claims.

52 weekends a year. That's all you've got. A hundred and four days of freedom. A hundred and four days to explore this beautiful country. Go on great adventures and follow where ever the impulse takes you.

Two thousand four hundred and ninety six hours of beach, bush, country, off roading, getaways, holidays, and road tripping. Nearly fifteen thousand minutes of irreplaceable family memories. So don't waste a single minute. Conquer your weekends in the Ford Everest.

It it is tantalizing, isn't it, thinking about where when when we when we come up with these new theories that don't seem to come from the data, Where do they come from? You could be a Platonist, and you could say they are just a gift from God and somehow or nativist, and somehow they're they're in our our brain. Or maybe we could subscribe to the idea that there is a there's a kind of deductive tree. Right? So there is the the tree of all possible conceivable knowledge, and that represents our cognitive horizon.

And if only we could build systems that could acquire that tree very abstractly, and if we could design a compositional system that could creatively explore that tree, then somewhere in that tree, we would be able to discover these abstract theorems. But then there's the question of, well, why don't large language models do this now? So we see the ARC challenge, for example. And what we see is that models are very, very bad at doing abstract compositional reasoning, where you need to take abstract things, combine them together to to adapt to novelty. They don't do that very well.

And in a sense, one one kind of, optimistic view is that these models are learning lots of factored representations, and it is conceivable that something like a large language model could do that. But the other school of thought is that it's it's just not possible because they don't have abstract enough understanding. What do you think?

I believe at least from our current understanding about what the large language model, at least current architecture is doing, is precisely using the same mechanism that we extract empirical knowledge from the data, the correlations, the low dimensional structure within the data to memorize the natural language, right? Which to some extent representing the knowledge we have, right. Hence, then the question is then the large language model is precisely using the same process as we acquire our empirical knowledge to process the large volume of natural language. From the mechanism part, I don't think it actually needs to understand it. Now, but in order to make that statement conclusive, you need to know then what is what extra thing that leads to your question, right?

Then what means by understanding? What do I mean by being able to really have that deductive structure, right? So this is a big question, to be honest. I think that's a question truly that the scientific community needs to answer now, to also to address, or at least to discuss now. And what do we mean by that?

Right. So, you know, in a sense we have modern science, there's always two schools of process that allow us to propel the science to advance. One is inductive, right? We do experimentally observe, and why it's deductive, right? Once we have accumulated enough inductive, you know, observations, experiments, empirical observation, we start to make hypothesis, assumptions, right, axioms, so that then we actually go through a very rigorous deductive process to derive what's the implication of those assumptions.

Then reach conclusions that is testable, that is actually measurable. Then through the verification, through experiment, we can measure, then we can either verify or falsify the original assumptions, right? That's a very powerful process. But that process precisely relies on we have a very rigorous logic detector system. Without that, right, we cannot overthrow the original assumption, right?

So, that is very fundamentally deep, you know, abstract process. So, the way we have that in our brain, we can argue whether or not animal has that, or at what point we had this, we reached this critical moment that we'd be able to develop that ability. Our brain evolved to what? Maybe it's some we'll go through some phase transition in our brain structure. It will allow us to identify those structures, platonic, whatever you call it, right, structures.

And we are able to, not only the original scientist, discover the logical, causal, and abstraction. But also, other people also gain the ability to understand it. That is actually quite amazing if you think about it, right? Different human beings through language we communicate, somehow we all at a certain stage develop ability to understand, to be able to learn mathematics. Even this is discovered somewhere else, but I can reach a similar level of understanding as the original discovery.

I may also be similarly convinced by the proof is rigorous. The logic deduction is something necessary, right? Not just some like natural language has ambiguity in it, right? So this is something up for debate. Is this really that, you know, there's a god there, that there also there is some environmental knowledge there, truly, you know, ground truth there?

We don't know. But definitely this the ability, how do we allow us to reach that level of ability? I think that is truly the next we need to understand. What are the mechanism? Implementable, reproducible mechanism allow us to recruit, to have a system to be able to gain that kind of ability.

I think that will be the next stage of intelligence. So we'll be able to have artificial system reach the stage of human educated, enlightened human, right, beyond our humans as a, just as an animal.

And do you believe in practice that's possible?

I truly believe that is a part of our brain or function I think can be reproduced, can be discovered, understood, and even reproduced. But for now, what exactly that mechanism is? I don't think we have much clue, right? We know the after facts, just like your example. We know the road it built, right?

The road building company or network, right? They can build the road. We know so even today we learn, we can learn logic, we can learn mathematics. But what is the mechanism allow us to create a new mathematical theorem, create a new scientific theory, conduct the logic deduction, right, and understand it. That what is that mechanism?

We don't understand. We only know the after facts, right? The logic makes sense. But why it makes sense? That's still.

What is the mechanism in our brain that we shared, that all this process, all the deductive process makes sense to all of us, and is still quite unclear.

The reason I asked the question is a lot of cognitive science folks, they say that we understand because we're causally embedded. And what they mean by that is we evolved with the world, which means all of the representations in our brain co evolved with other things in the world, and it's deeply rooted. So the implication is that intelligence is quite specialized, and you could say intelligence, oh, it's just the efficient search of the space of Turing machine algorithms. And, yeah, that would be correct, but it but it's describing the what's not the how. It's it's so it's almost trivial to say that it doesn't describe how we would implement intelligence.

Whereas, if intelligence is the acquisition of knowledge, it must be domain specific, right, because the road building company that we spoke about, they can't build any type of road. They are restricted in terms of the materials that they have access to, and presumably different types of knowledge are quite different to others. So an intelligent process might be able to acquire knowledge over here, but not over there. How specialized do you think intelligence is?

Remember, the road building company, that's just a allergy. It's a special case. That's exactly sort of echo with what we mentioned earlier. The intelligence has different stages, or even different forms, but there are always something in common there, right? Even the early stage, how DNA evolves, how our memory evolves, how human society evolves, or how our scientific knowledge evolves, right?

Those are precisely they're different. You can think they're different builds. So some are building roads. But the mechanism are common. They're using similar, for example, the concept of compression discover.

The mechanism that's common behind all this is that to discover what is structured, what is not random. And I think that is common through a, you know, under the principle of parsimony or compression or dimension reduction or denoising to discover those structure. That is common. Although the mechanism, you know, operation might be different, the domain knowledge distribution applied to might be different. Some are discrete.

Some are continuous. Some are higher degree the intrinsic degree are higher. Some are lower. Some are simpler. Some can be formulated as mathematical physical equations.

Some cannot, right? But they still can be learned, can be compressed, can be memorized through other mediums in DNA, in our neural networks, not in, you know, differential equations, right? So those actually there are things that are very common. The mechanism behind purpose are common. Even the principles are common.

But the realization, right, the physical realization could be different. And the optimization mechanism could be different. The code book we learn could be different, right? So this is something we need to understand. I think once you understand that, a lot of it can help you understand we can understand a lot of things around us, right?

What is common behind all the intelligent behaviors? And also each stages or each deforms may also have their own domain specific characteristics.

I know you're a big fan of cybernetics, which came from the forties, with Norbert Wiener. And it described this cybernetic action loop, which is an agent which senses and acts in in this closed feedback loop. And just, you know, based on what you just said, maybe the mechanism of intelligence is the same, but certainly the action space is different. So so different, embodied agents in this situation would be able to do a, b, and c and overhear something different. So could it be at least specialized in the action domain?

Those scientists, those pioneers that about those are the pioneers in 1940s. They're interested in intelligence. But the level of intelligence, think their interest is mostly at the animal level and the human level. It's about how our brain works. It's not about DNA, although they make some analogy, but they very much specify how our brain acquire our memory, build a world model quickly through perceptions and actions interacting with it to predict and make mistakes and learn from that process, right?

So, I think that's the stage they're studying. They may not quite get to the advanced stage that we'll talk about the science. They care mostly about how our memory works, the animal, right, to build autonomous systems, autonomous machines, so they're emulating our ability at that level, right? So hence, it's the we call it cybernetic programs. By the way, I mean, point is that a lot of people there are two things.

Of course, cybernetics world was a little bit abused later, right? Become just like artificial intelligence, actually become very bad word for a certain period of time, right? I think it's probably, and also many people do it's understanding cybernetics in a very narrow minded. They actually think it's just about a control. Actually, it's not.

If you read the book, right, and by Nobel Vener, he actually characterized is the book actually characterized all the at least the necessity, necessary characteristic, what the intelligence system, the system at the animal level should have. How to record information? He actually leads to information theory. How do we correct error, which is the feedback control? How do we improve our decision making?

He actually mentioned the game theory, right? And so on and so forth. He even discussed the decision of addressing nonlinearity, which actually explains why our brain has waves and so on and so forth. But it's very clear, he's definitely interested what are the necessary characteristics for a system to be intelligent. Although he may not, he foreshorted how they should be put together, but definitely captured some of the essential characteristics an autonomous intelligence system should have.

Which surprisingly, it's got forgotten by our past decade of practice of artificial system, building artificial intelligence systems, right? And don't forget, those are necessary characteristics that, you know, and at least those pioneers are convinced intelligence should have. So this is actually something I think, you know, we should probably learn a little bit lesson from our history.

Can you sketch out the journey from information theory to your maximum rate coding reduction system?

This is actually a very interesting question. I'm not actually honestly, I'm not if I'm being serious, right, So I was trained as a control theorist. But early when I was a graduate student, wanted to do communication. I took a lot of communication, random process, information theory course, although I didn't end up with doing it. So for many years, you know, this has become something I never really quite practiced.

Until about a few years ago, I was studying what the network is doing, and also come across what my intelligence is all about. Realize that maybe intelligence is trying to learn the common mathematic problem behind at least the learning knowledge at the level of memory is probably pursuing a low dimensional structure or low dimensional distribution of high dimensional data. Okay. So, once that becomes clear to me, then it becomes a very soul searching process for me. Now, if low dimensionality is the only prior the only so called inductive bias or assumption we can make, from there can we deduct, can we deduce everything from there?

No. But if this is the only thing we can use, then what's special about it? It's the dimension, very low. Hence, the volume of the data should be very low. Then there's actually comes a lot of technical challenges is that then how do you differentiate?

If I have a set of data, if there are two models, both are low dimensional, can equally support the data, explain the data, then which one do I choose? Right. So, the challenge here is that if two models are both low dimensional, then the volumes are all zeros. And I have an example in my book, right? It troubles me a lot, right?

And we have eight dots on the line. One is evenly distributed eight dots. Another one is, you know, four dots are clustered, another one. And the interpretation is very ambiguous if you think about it. What's wrong?

I treat all as just eight dots. Each one occurred one eighth, you know, or it's probability one eighth, right? Nothing wrong with it. Or when should I say they're all lies on a straight line, right? But a line already has one degree, one dimension higher than dots, right?

Then also, what's wrong, say those are eight dots aligned on a plane, right? But they're all zeros, zero dimensions, right? So, there's a question is how do we measure the volume of the data, right? If you want to compress, you have to have a more generalized notion of volume to measure the space spanned by the data. So, that come forced me to come across the concept of entropy.

But entropy also is kind of limited because precisely it does not quite differentiate those kind of interpretations, right? If you think about it, they're all zero one dimensional distribution, then the differential entropy is negative infinity. You are comparing, you know, infinite with another infinite or zero with another zero. So hence, we sort of actually come across maybe a lossy coding, right, is just like Shannon developed after information theory. He actually mentioned that when we actually do coding, we do redistortion.

We do noisy coding, right? That actually become a sort of magical source, allow us to now we have a more generalized, much more general measure of volume for data in arbitrary space, right? And the measure, the support can be actually degenerate. You can differentiate one low dimensional model against another one, right? And that becomes a measure we can weight the data to differentiate different models from different models.

Now, it will allow us if we compress this coding rate, we compress those coding rate, will allow us to pursue those distributions in a very high dimensional space. In fact, the very popular diffusion denoising process precisely doing this, their denoising process of precisely reducing the entropy so that we pursue a representation that is lower dimensional, lower entropy. In the end, it converge to the distribution of the data. This is the first stage. Now, for memory, it's not just the pursuing find where the distribution is, right?

I use the analogy, right? So, find where the distribution is. It's also about organizing it, right? A lot of people say, oh, you know, learning is about compression. Hence, you know, why don't we do all the way to common graph complexity?

No, you don't want to do that. First of all, common graph complexity is not computable. Second of all, we all know if you really manage to compress your data to common graph complexity, the codes will be random. Basically, the program itself, specify the data will be random. We don't memorize a bunch of random numbers in our brain, right?

Hence, actually, our memory in our cortex is highly structured, right? Different type of objects is very well organized in IT cortex. And our spatial understanding is very well organized in hippocampus, right? It's highly structured because the structure allows us to access. Because we want to access the knowledge repeatedly.

We want to use it very efficiently. Hence, the knowledge has to be we use it under very, very different conditions. As I said, you know, our brain is very much doing BSC inferences. Once we learn that distribution, when we organize the, we transform the distribution to a very structured and organized way. So, the maximum entropy, you can see the maximum rate reduction is precisely reflect that ethnicity.

You do not just by reducing the coding rate to find where the distribution is. Also want to transform it in a new representation such that the rate reduction gets maximized. The reduction gets maximized. Hence, the representation subsequently becomes structured and organized, then facilitate efficient access. And you can access memory under all types of conditions, allow us to do all types of condition prediction, generation, and estimation, right?

That's the essence. Hence, reduce the entropy, reduce the coding rate, and then maximize that reduction of coding rate. They actually reflect two related process for building a good memory.

First of all, the the manifold hypothesis comes to mind, is this idea that all natural data falls on some low dimensional, you know, structure, you know, with with a low intrinsic dimension. And the other thing that springs to mind is, I mean, I'm a fan of geometric deep learning, which is this idea of, you know, we should imbue inductive priors in the system, which represents symmetries and and geometric structures in in the world. And and I think as a principle, that's deeply embedded in in this idea.

Exactly. If you look at my whole life, I have written four books, right? My early interest is studying computer vision. And the first book is Obsidian Vision. And from that work, I studied multi view geometry.

From that work, already in my whole book, all the four books is actually about the one theme I realized that is about structure in the data, especially all reflected in the first book of three d vision. The last chapter, I realized precisely about what importance a symmetry plays in our perception, right? And we perceive an object. We naturally has a so human being, we remember our vision. We already realized a long time ago, right?

These days, people always about vision is about recreate a three d. Absolutely not, right? All the people says, we just multiple images recreate a whole point cloud, mesh, sun disk function, LERF, Gaussian splatter, right? We recreate a scene. See, I can see from multiple angles.

Is this three d understanding? Or will you create some videos like Sora, look at it looks good for, you know, absolutely not, right? This is not the representation or understanding of a word model, right? Our understanding is far beyond getting a whole bunch of, you know, point clouds or Gaussian spatter we can view from different angles, right? Have you noticed that when we see something gets really we get excited because we understand the three d.

We understand the content. We parse it already in our brain. But the machine has no idea what the heck is in there, okay? It's just a bunch of point clouds. It's a depth map, okay?

When we see this change in angle, saw three d. We already automatically recognize, this is a hand. This is a body. This is a cup, this is an apple, right? We do that.

We fill in that information with our brain. We think machine, once they can recruit a three d, they all don't understand that. This is completely wrong, Okay? Many, many work, they say we are building three d model by creating something for people to look at. That's complete without purpose.

So look at our vision, our vision model, right? We have hippocampus. We have our ID code. It's highly structured. We understand the relationships between our view centric, object centric, and allocentric reputation, right?

Neuroscientists understand this very well, right? Scientists understand this very well, but not computer scientists, right? Not computer vision scientists. Some do. See, for example, I give you an example, right?

In order to do spatial we actually did a test with about a year ago, tested all the top multi model models, you know, huge models, GPT, you know, that GM9, right, to do very simple test. The title of the work called Eyes Wide Shot, right? And it's a very simple test. It's just a test that gives images for the those language model or a big model, multi model, highly trained, highly commercialized models that do they understand the reasoning, spatial reasoning? What is on the left of something?

How many objects there is in space? All right? What is behind something? What's on top of something? Very simple spatial the question requires a little bit, not even very deep spatial understanding.

But all the models fails miserably. And the majority of them actually even worse than random guess, right? Only I think only Gemini and TBD is a little bit above human a little bit above the random guess, Okay? Far below human understanding. So that's the status, right?

If I see you do those, meaning that those three d understanding is very, very difficult. But humans, we do this effortlessly, right? So I can easily point to you, right, please hand me the bottle to your left. If you want to find, let's say, a shopping center, right, say, go through the door, turn right, go to the once you get outside of the building, head south, right? So, this remember, just through this simple sentence, already we switch from view centric to object centric to allocentric, right?

So, our if we don't have this kind of model, this kind of highly structured three d models, forget about, you know, people talk about embodied AI or a world model, right? We cannot conduct this very simple spatial references interacting. We have the world model not to visualize. We build a three d model to interact, to manipulate, to influence, right? We're not building a three d model just to, oh, this looks like I can change my view, look at it from this view or that view, right?

Just visualize. No, turn three sixty degree to visualize. No, we don't do that. That's not our purpose, Okay? Unfortunately, but it gets distracted for that kind of visualization.

It looks cool, but almost to us, if you really work on robotics for all kind of navigation, locomotion, manipulation, they're actually pretty, the usage is pretty limited. I won't necessarily say they're useless, but they are actually pretty limited.

This is Linda from Vanta. In today's world, compliance regulations are changing constantly, and earning customer trust has never mattered more. Banta helps companies get compliant fast and stay secure with the most advanced AI, automation and continuous monitoring out there. So whether you're a startup going for your first SOC two or ISO 27,001 or a growing enterprise managing vendor risk, Vanta makes it quick, easy, and scalable. And I'm not just saying that because I work here.

Get started at vanta.com. Hello. No pole. Please go. Santa, you're on in 321.

Over here. Ho ho ho. Children, I hope you've been good this year because my elves have been extra busy connecting every Telstra phone booth in Australia to, yes, me. Tell me what you'd like for Christmas by heading down to your nearest phone booth and calling free. Oh, for one second, please.

Let us start taking those calls.

We should introduce the coding rate formula. I did have a question about that, which is there is a there there's an Epsilon on that. So there there's a bit of a question of how how do we tune that and and what does that mean? We should also bring we've been talking about this a little bit, this this concept of an LDR, so a linear, discriminative representation. And just more broadly, with these inductive priors, there's always the question of when we do abstraction to model regularities in the universe, there's always a little bit leftover, isn't there?

So to to what extent can we think of these things as natural?

Actually, you touched upon about the EAPSLOW. You touched upon a very, very deep question. We actually it actually took me almost thirty years to understand it, to be honest. Right? We did mention that early on when we do try to differentiate different measure different volumes.

It turns out the lossy coding is necessary. It's not just something that is something hacky. It actually turns out to be necessary to do lossy coding. In fact, recently we started to realize that noise actually plays very different roles. And yet, it's very confounded, very confusing to many people.

This is something actually my students will sit and realize. We actually probably will have some papers about it. I can elucidate this a little bit. If you think about the whole diffusion denoising model, right, people are very popular right now to do why do we add noise to data, right, and to the whole world? Because we don't know where the distribution is, right?

So, there is a phrase everybody knows, all rose to Rome, right? So, why is that? Has anyone given a thought to why all rose to Rome? Because, very simple, at some point in history, Rome builds the road to reach the whole world, right? That's a diffusion process.

Then if you want to know Rome, then you do the denoise. It follows the same way back. You get to where the Rome is. So that's the node dimensional structure. That's where the knowledge is, right?

That's the oasis is, right? So hence, it's a very natural process that we add noise. It's adding noise to the precisely building the roads. And the denoising brings us back, remember where we come from, and so on. And that's a big epsilon.

We have to add noise to reach the whole earth. There's another actually, there's another noise, right? Remember, we only have isolated sample. Even we talk about manifolds, right? But how many points do you have on the manifolds?

How many points do you observe them? They're always finite, right? But why do you call it a continuum? Why do you collect dots as lines, planes, surfaces? When do you do that?

Hence, noise plays another role within the manifold. Even you have finite samples, if you allow lossy coding, if you allow packing spheres in that, things start to connect. You start to connect. Noise is very important to help to connect the dots, right? We all know the phenomena of percolation, right?

We see raindrops on the floor. You only see two phases, right? One phase is all the dots are isolated. Another phase is all things get wet. You never see anything in the middle because there's a sharp phase transition.

Once the sphere once the dots, the density gets high enough, it collects everything, right? Maybe that's a phase transition we reach. We realize a connected plane is a better solution to explain all the data, more parsimonious, more economic. The cost to memorize all the thoughts versus to memorize the other plane start to switch. Maybe abstraction has something to do with that.

I don't know. But from a compression point of view, this can really allow us to explain when do we go from zero dimension samples to prefer a low dimensional manifold? And also, how go from that low dimensional manifold to reach the rest of the world? Right? So you can see, even in this process, noise is already playing epsilon plays different roles.

And at some point, they get collected, right, around the surface. That's why we're still trying to figure out what happens. But at this big two phases, we already know, right, the role of epsilon actually plays different roles, right? And I think the definitely in the past many years, our understanding about the subject of how do we compress, how do we pursue the low dimensional structure from finite samples, It's quite our understanding about this problem has truly advanced dramatically. I'm very happy.

Honestly, this is a question that baffles me when I was a graduate student. You can see even my early work about the lossy coding, lossy compression reflected my baffledness about it. And I'm really feel very thrilled that I recently started to understand those things in a more unified, more, not only theoretical way, but also even algorithmic way.

Yeah. It's so fascinating that we can look out the window and we we ignore so much detail. We don't look at the leaves on the roads. We we just find that structure. And that's why when I watched your presentation, I was very intrigued when you said that denoising iterative denoising is is a form of of compression.

I wanted to mention your ICML twenty twenty four, so last year, it was in Vienna, right, with with Wang. And you found that when you have lost surfaces using using this technique, they are dramatically different. They're very smooth. There's no kind of harsh local minima and so on. What's the intuition for that?

In fact, the phenomena, our understanding about those phenomena is actually going back to the early days we studied sparsity. You know, when your data lies on very low dimensional sparse surfaces, planes, right, low dimensional planes, orthogonal planes, or low rank matrix, right? And in there, we learned a very big lesson. The object function to evaluate those sparsity or low dimensionality, those functions are highly nonlinear, non convex. But yet, you know, traditionally in all our orthodox understanding about the non convex optimization is they're always hard, right?

And the general class is NP hard, and there's lots of local spurious local minima. You get stuck with local minima. You there are some stagnant critical points, flat surface. So basically, worst picture is very worse, right? It's a nightmare.

But through the study of those low dimensional structure, sparse structure, that's what's actually featured in my previous book, right, high dimensional a low dimensional structure of high dimensional data lattices, we actually realized that if a lot of long convex problem, even the optimization problem had long convex landscape, if those problems or even those measures arise from nature, very natural resource, those structure actually are very highly regular, highly asymmetry. The landscape actually are extremely benign, right? Quite contrary to our common understanding about the nonlinear optimization at all, right? This is complete 180 degree flip of views. In fact, even the higher dimension helps.

The higher the dimension, the better. We call it a blessing of dimensionality. So, those regularity, those symmetry will tell us the landscape of this object function are actually beautiful, right? And first of all, they're highly regular. There's no stagnant.

There's no flat surface. There's no too many spirits, local minima. And even the local minima, they already have very clear geometrical statistical meaning. And hence, those landscapes are very amenable for very simple algorithm to find the optimal solution, such as a gradient descent, which almost indirectly explains why even we're doing even modern training neural networks, and many more, we're searching low dimensional distribution in very high dimensional spaces. But somehow gradient descent always end up with somewhere nice.

Okay, yeah, fine. You can run a long time, but somehow you always end up with those landscapes are not that hard to traverse, right? So, it actually could be precisely because those object function are highly regular. Hence, now, get back to the read reduction object function, right? If you look at the object function, it's not something arbitrary, right?

It's counting the volume of the whole minus the parts, right? It's something extremely objective, right? It's not like a loss function people come up with randomly, oh, add this term, weighted sum, add different weights. You know, if you use this, you know, sort of an empirical penalty or empirical or even some kind of ad hoc. So, all the terms are describing physical volumes of the data, right?

Hence, you should expect those are the quantity arising in nature. Though for our lesson, we realized, indeed, actually, that actually those option function has very benign landscapes. Even the local minima, not only the global minima, corresponding solutions that are giving you orthogonal subspaces, even the local ones, right, they have a lot of global optimal ones. They have similar geometric structures, right? And there's no other weird critical points that will slow down the search for those minimas.

So, that's actually quite interesting. So, you can see this revelation allow us to understand, right? Where maybe intelligence is precisely explored and harnessing those things. So, is actually a misunderstanding about, you know, last two years. When we understand intelligence more and more, there is a very big misunderstanding about intelligence, right?

In machine you study, you know, machine learning theory, right? We have a tendency to believe intelligence, especially intelligence in nature, is designed to solve the hardest problem, the worst case. I actually beg to differ. Intelligence is precisely the ability to identify what is easy to address first, what easy to learn, what is natural to learn first. The only one that has been done and the resource permit is start to get into more and more advanced tasks.

Not everybody needs to learn all the mathematics to survive. Animals don't, right? Nature finds what is the easiest things with minimal energy, minimal effort almost, to learn the most knowledge so they survive the best, right? Again, this is principle of parsimony at play. There's another level of resource parsimony at play here, right?

Hence, once you realize this, you're realizing understanding intelligence should be really understand what's really most common, right? The low dimensional structure, most easy ones, smooth ones, benign distribution, easy to get, get away with a few samples, fewer samples, right, and very easy to formulate. In fact, that's how science progressed. You know, a lot of the physical models, you know, Newton's law, they're very simple. The simple ones.

Then we reach gradually reach through general relativity and then to quantum mechanics. Those equations get far more and more complicated later, right? So this is, hence, the same process. We identify what is the most common first, right? What is the most easiest task first?

Hence, we don't want to many of the machine learning theory tends to derive a bounds for the worst cases. I think that's we probably should think twice. Right.

I love that characterisation. It's similar to the least action principle in in physics. Exactly. In in a sense, we solve problems by taking many, many steps in different directions. I think we still leave a little bit of entropy open.

We don't do pure hill climbing. Yeah. But collectively, we acquire these stepping stones and the totality of that process is we solve very complex problems. But I wanted to touch on you raised a very interesting point, which is we notice that when we have very large deep learning models, they tend to almost self regularize and they they learn better. And there's this phenomenon of double descent and and all of this.

Tell me about that.

Fascinating question. Actually, you are this question actually needs to really bring me back to early days when I tried to understand deep learning. When deep learning arises, a lot of phenomena we try to understand. I'm one of those, right? Try to understand those phenomenas.

You know, there's something good about dropout, something about, you know, thresholding, different thresholding, there's something about normalization. Then it also come to, somehow the model are very big and parameters are lot. Somehow the deep networks do not have a tendency to overfit. Somehow they still generalize Okay, right? And then, of course, people realize that there's a sort of rather unlike the traditional classical bias, a virus bias trade off, but there's tend to the double descent.

I actually wrote a couple of papers about it, and about the normalization, about everything. Around 2000 or late twenty nineteen, I really told my students we should stop, not to explain those isolated phenomena. We only see where like the blind men to elephants. Each one sees a little piece. Each theory tries to explain a little bit.

I think there should be a total explanation to this if we get the big picture. All these are just the consequences or implications of that. Suddenly, you know, at that time, we started to touch upon the concept of maybe that the process of deep data are optimizing something. The layer wise is realizing optimization. They are optimizing objective that promoting parsimony, promoting no dimensionality.

Once we realized, actually, I was quite thrilled. So then I told my student, from now on, we will no longer write any papers about overfitting. Why? Because if the neural networks is trying to the operators try to compress, try to realize certain contracting map, compress volume. They will never overfit, right?

Even I overparameterize, it will never overfit. A simple example, if I have data lies on a straight line, a one dimensional curve, whatever, I can embed this one dimensional line in a two dimension, three dimension, or a million dimension. But if my operator is always layer wise at each iteration, my operator is always just shrinking my solution towards the line or in all directions. I would never know of it. Even if I overparameterize embedded the the line

Opening week at the Australian Open. It's the week of firsts. First serve, first selfie, first ballpark, first beat drop. Be first in, best dressed with a ground pass from just $20. Your first move is ozopen.com.

We're RedZed. We just do loans for the self employed. That's it. So if you're the CEO, the admin, the accounts team, and the cleaner, hi. We're here for you because we know payslips don't tell the full story.

They don't show the weekends, the weird hours, or what it takes to back yourself. We don't try to do everything. We just back the people who do. RedZ, one trick. Great trick, though.

Talk to your broker or contact RedZ. Australian credit license 311128.

Into a billions of dimension. I have billions of parameter, but collectively, all those billions of parameter are all shrinking my solution, pushing the solution, denoise it, compress it towards the line, right, like a power iteration, just like a PCA, right? Power iteration is regardless of what the dimension embedded computing the first singular values, right? It's always powerful, or it's converging with the same speed. You never over so compression, by nature, if the operator are performing compression or denoising, which means this process will no longer overfit anything, right, if you conduct it right, if you converge, the solution will converge on the structure you desire for.

That raises a natural question. We were interviewing Andrew Wilson from NYU, and he's got this, know, several papers about implicit biases where you kind of have a combination of, you know, hard biases of symmetries and and everything in in between. And if what you're saying is true, then why do we need inductive biases at all? Could we not pare back a little bit and just have really big models?

No. I see. So this is the thing. Right? Exactly.

So this is the thing. I was, you know, early on, people don't understand deep networks, and there's a lot of empirical trial and error. People try to, tends to use the phrase inductive bias to, either as some kind of magical sauce that either explains the failures or success of you do a certain way to the neural networks design, or how do you train the neural networks. To be honest, for a long time, never understand what the inductive bias is. And maybe some regularization, some people are explaining some structures about network, about the data.

But nowadays, in my recent work, I said that probably from what I understand, all the inductive bias should be formulated as first principle, right? At least from we were able to, for example, deduce all the different architectures, including the recent white box create transformer like, or like ResNet like architecture, or mixture of expert like architecture, all from the only inductive bias is assuming your data distribution you are pursuing are low dimensional. Okay? You can already get the form, the main architecture, a form of operator for each layer. As a REST structure, mixture of XPRO structure.

And those operator per layer are precisely conducting denoising, compression, or contrasting. Are there additional assumptions you can make? Yes, you can. For example, if I my job is not just to compress the data as it is, I also wanted to induce, I wanted to, for example, in object recognition, I also want to enforce, make all data, I wanted my classification to be translational environment, which is symmetry, right? If you allow my task will be environment to certain group action, I want to compress them together.

Voila, what do you get? Because there is still compression, then you get a convolution, naturally, as the structure for the compression operator. So convolution is not what we impose upon. It actually results from the first principle, the quote unquote inductive bias assume. You want to compress your data.

Also, you want your compression to respect translation environment or rotation environments. That's the result. That is the characteristic of the compression operator for you to achieve that task, right? So there's a lot of so we don't want to build in the inductive bias while we're searching for the solution. The inductive bias, in my understanding, should be the very assumption we make in the very beginning.

The rest should be deduction. The rest should have no induction anymore. Otherwise, we're doing trial error, right? The inductive so basically, when we build a theory, we should have done all the inductive observations, experiment, and assumptions already. The good theory should start with the very few inductive bias or assumptions or axioms, then the rest should be deductive.

I call that first principle. We've been speaking about parsimony, which is what to learn, and

self consistency is about how to learn. And we can sketch out a journey, I suppose, from control theory to learning. And and also this this methodology has some interesting, results, I think, around, you know, the continual learning problem. So let's sketch that out.

You can see, right? So the the compression, or the even the rate reduction to try to pursue the data distribution and also transform it, and that's the one way direction, there's almost there's no theoretic guarantee either your data is sufficient to identify that. You you may start it with very, very few samples. There's no way the data is sufficient. I mean, maybe apple, there are five types.

Maybe I only say four types, right? So and then but that process goes on. You do compress what you have and reach a memory, right? And there's no guarantee that even during that process, you may not get stuck. Maybe there's not enough iterations.

You may so memory you get may not be accurate, may not be correct. Hence, how do we check? How do I further develop, evolve, improve my memory? Or make sure my memory actually be able to authentically predict this is a world model. The model is accurate, right?

So, you actually have to decode it. You can think about the memory formation is an encoding process. Then from a memory, I want to decode. I want to predict what's going to happen next second from what I observe right now. Or at night, I may want to dream what happened, Right?

So that's actually the decoding that allows us to check if my memory serves to be right, right? How accurate I can predict next step. So hence, this actually already form a sort of autoencoding framework. Now, of course, auto encoding, if I have access to both the observation and my memory, just like our training our big data models, I have control on both ends. I can just force auto encoding back end to end, right?

The people like to talk about it. But in our natural in a natural setting, in animal, human setting, we don't have control on both ends, right? We only have control, probably control of our own brain, what's inside our brain, right? We never really quite have access to measure if or by prediction of the three d world. For example, right, the frame of the picture is rectangular.

Do I ever measure it? Right? You don't have to measure it. But somehow, everybody believes, you know, the model is correct. How do we do that?

Right? Hence, there's actually a self correcting process. In fact, you know, this is actually probably the idea goes the idea actually goes back to Noble Weiner, right? And how animal be able to correct its error without see, cats can capture something very accurately. But even make one single mistake, they can correct that very clearly, right?

So, somehow, they're able to build a world model very consistent, self consistent with the world without actually physically measuring their errors, right? So, hence, this is the idea about you actually loop it back to your brain and close the loop, right? And allows you to constantly predict. And based on your prediction and your observations, check if there is a difference the difference between your predictive prediction and the observation within your brain. Where if there's error, are using that error to correct?

Turns out this is the work with my student turns out you cannot do this. Of course, our observation will lose information, right? We'll introduce noise or lose dimension, lose information. But it turns out, as long as the distribution of the world, the data, the diffusion of the world is low dimensional enough, even your encoding process, your observation perception process is noisy. This is still doable, if precisely when the distribution of the data outside world has enough structure, is highly low dimension.

And hence, your brain has enough degree of freedom to discern any differences. So this is actually quite an interesting revelation for us to realize that the low dimensionality is not just a technical assumption. It's actually necessary for this kind of closed loop learning to be possible. And once you'll be able to close loop, then you can actually constantly observe, constantly predict. As you can constantly use your memory to predict and correct it.

Hence, continuous learning, even lifelong learning, right? Our memory, Rome is not building one day. Our memory is never building one day. We constantly improve it, constantly revise it. And this is the mechanism of intelligence.

Hence, this mechanism itself is already generalizable. Hence, you don't need to add the adjective general in front of intelligence. There's no point of calling general intelligence. If you implement the intelligence mechanism correctly, it's already generalizable. The knowledge learned by this mechanism at any point of time may not be generalizable.

The mechanism does, right? This is a very big confusion. We think if I accumulate enough knowledge, it's generalizable. No, it's not. Will never be.

Any scientific theory, by definition, being scientific is falsifiable, which means it's limited, right? Can only explain the world up to a certain point or certain accuracy. There's always room for improvement. The scientific activity, our ability to revise our memory, to acquire new memory, that is a generalizable ability. That is intelligence, right?

Through natural selection, early days, through our feedback control, feedback correction, through the human history of trial and error, accumulating empirical knowledge, through scientific discovery, it's all doing this, right? That is common behind intelligence, right? Not the memory accumulated up to a certain point. So even we manage to memorize the whole world, the knowledge we have in the whole world, we will no longer be able to apply when we find ourselves in a new environment, in a new situation, observe some phenomena we have never seen before, Right? Hence, that's the limitation of, you know, you try to gain general intelligence through just accumulating enough knowledge.

We should talk about your CRATE series of architectures. So CRATE stands for coding rate reduction transformer, and you made some very interesting discoveries. So for example, multi head self attention can be derived as a gradient step on rate coding, and also MLPs as sparsifying, specification operators. And and also, you were talking about how something like a transformer could be described in a principled way. So there there's this interesting thing, isn't there, that we we we designed it even well, we we didn't even design them.

We we kind of empirically tried of lots of different things, we happened upon the transformer. But something like that can actually come about from a first principles approach.

If you look at the past decade or so, evolution of also, it's kind of natural selection process for the big models. Right? From early days, Alexlet, LoLET, Alexlet, VGG, or then ResNet or transformers. By the way, this is just one of those for survivors, right? As I said, just lateral selections, right?

Remember, people don't forget. There's a time there is a very popular area called AutoML, right? People tend to do random search for better architectures, right? Somehow, why only a few survived? There must be a reason, right?

They must capture certain structures. They must did something right. Now, from our understanding so far, the ResNet actually capturing the fact that each layer should be doing optimization. The ResNet precisely reflect the iterative optimization architecture, right? And MOE precisely capture fact.

We're trying to cluster, compress what's similar, and discern or classify what's different or contrast what's dissimilar, right? And you wanted to develop different experts, right? We call them experts. We call them cluster. We call them a group.

So be it, right? And the transformer, again, right, has captured what is the correlation. The self attention has precisely compute what is the correlation in the data, covariance in the data, what's correlated, and using that to further sparsify, further classify things, to organize the distributions. They must do something. They're somehow close to something right, right?

So also, it's almost like a belief for us, right? If we believe there's something right, then we should be able to derive, create from first principle, have a very clear unified understanding. I think we're sort of managed to do that, at least for the structure we've discovered so far, provide rather unified explanation to what they have done. To be honest, the early even maybe our earliest motives try to explain to understand what we have done. But once we understood it, we realized that we can go much further, right?

And realize even the current architectures, there's a lot of room for improvement. Not only we can dramatically simplify them, you can see in the past after the in the past yes, last year and this year, there's a series of work from my group, right? Really just showcase people, right? You can actually once you understand what is done with the principle, you can dramatically simplify. You can even throw away the MLP layer if you only care about the compression.

You don't care about the final representation. Or you can make the attention head. Since we know what is optimizing, it's optimizing the rate reduction object function, then we can find what is the equivalent variation form of that object function, which is much easier to optimize. We end up with a we call it a toss, right? The computing the covariance the self tension step is only linear in the dimension, no longer quadratic, like the current tension is doing.

Of course, if you look at the literature, there are other people have found try to identify linear complexity, such as, you know, Mamba, or I think there's RWVK or something, so empirically. But again, it's through trial and error. But this is so now we derive this in the math in purely mathematical way, because we just find an equivalent variational form of the same object function that have the same global optimal. But it's just much easier to optimize. This is a trick we do all the time, right?

All the tricks, see, in the two hundred years plus years of developing better optimization algorithms, all those ideas can help us now to design better operator, descent operators, or optimization architectures to improve the design of current architectures. Honestly, we have not really started that far, right? There's many acceleration techniques, preconditioning, conjugate gradient, which explore different landscape. Once we understand the landscape, the type the cost of object function better, there's gazillions of ideas. We can further improve the efficiency.

Honestly, we haven't started that far, right? I mean, that's actually got some of my students excited to pursue this, realizing how little we have done from an optimization perspective, how much room there might still be for improvement. Some of my students are quite excited. So you can see, you know, even within the last couple of years, we already have two or three different generation of architectures that in the past, it's almost unthinkable, because the new generation always come from a different group, right? It's like a random process.

Whoever gets lucky, maybe discover something works. Try hard enough to get something to work.

It's a tantalizing idea though, that through this principle of optimization, there could be, you know, a convergent evolution towards the optimum architecture.

Then the search will no longer be random, will be actually guided, right? There's still, just like back to your earlier suggestion, right, that this has become intelligent search, it's, you know, guided search. We understand the structure of the problems now, and hence we can do science now. We are no longer just to do empirical inductive search process.

Why do OpenAI they're still using the transformer even though there are now superior architectures out there. And we should talk about this token statistics, transformer. So as you just said, it's a linear time complexity, complexity, which means in principle, this is something which is going to scale dramatically better than the kind of transformers we're using now. So why aren't we using it?

Well, there are tempted to try to scale this up. In fact, even you can think about many of of course, when you try to scale, the other factor comes in, right, in terms of whether or not the scalability and so on. It's all related to all the design. And indeed, we actually tried something else. For example, things are much more scalable.

Also, I also tried the tool. We also scaled up with all the resources we have. Sometimes, I don't know about a company, right? We are very limited in resource. To verify even our architecture scales, we can only do up to probably a few, you know, a couple 100 cars and so on.

That's about it with our academic resources. And hopefully that will convince. But the one thing I think recently we did to simplify the current practice in Dino, right? You know, the meta has to, which is to pre train the state of art people, everybody talks about world model, visual world model. And that's sort of the best model that Meta put a lot of effort, engineering effort, to pre train the visual representation model, which is still the best.

They train on gazillions of images. And they were using contrastive learning. But it's a very remarkable engineering feat. And now people are using it, right? It turns out, actually, we found that the system can be dramatically simplified.

Once we realize what the purpose, what they're actually really trying to do, right? We have a work called a SynDano, simplified demo, version one, version two. We simplify both versions. The architecture is dramatic. So we get rid of dozens of hyperparameters we don't have to do.

And architecture become extremely, extremely, 10 times simpler. And then the performance is better. We managed to scale up to probably a few 100 millions of scales. The apple to apple comparison were dramatically much easier to train, much efficient. Everything is sustainable.

I think that has a serious draw attention from the Meta team and also from the Google team. And it's currently that they are, I know there are serious effort that they're trying to scale this, the new architectures up.

Yes. We interviewed the Dino folks at the time and we've spoken to people like Isham Misra. There's there I mean, there's a potential tangent there about they're using this kind of non contrastive self supervised learning. And also, there's the whole unsupervised thing and and how useful those representations are for downstream tasks. May maybe we could go there.

But I should say that Kevin Murphy, I'm I'm interviewing interviewing him soon, and I know that he reviewed your your book very, very carefully. And he asked me to give you this question. He said, code reduction is great, but must be subject to prediction or reconstruction loss in data space. How would you go beyond token prediction, which seems especially weird for images? So that that's what Kevin asked me to ask you.

So this is actually a great question. Right? So in a in a read reduction, remember, the the loss in this is actually coded through the epsilon ball. We actually try to capture the samples, how they connect with one another. Right now, we actually, if we just minimize the coding representation through this lossy coding, and the error is kind of controlled by the epsilon, but not enforced, right?

So we respect the epsilon through this lossy coding process. Now, to truly ensure, remember, everything could go wrong. And it also depends on the number of samples you have. Maybe the epsilon you choose is wrong because the data does not have that density. So you'd be not be able to percolate.

Hence, the repetition learned can be very, very funky. So now, order to ensure that your learned reputation, distribution learned internally actually authentically reflect the original distribution up to a certain precision you have to decode, right? There is a constant encoding decoding. Actually, our brain do that all the time, predictive coding and so on and so forth. And hence, that encoding decoding and to verify if there's error remained in your prediction, in your reconstruction matters a lot.

Owning that, now the question is, if we don't have we necessarily back to our earlier discussion, right? Do we really need to measure that error in the data space, in the original token space? If we have that option, so be it. Do that, right? Make the engineering simpler, make the but if we really want to have a system just like a human, it's to self learn, just go out to observe, right, with two eyes or with some sensors.

Then we have to come up with a way to make sure that our sensing process is accurate enough so that we can do everything internally. We can predict, go back rapid back, and observe, compare what we predicted and what we observed through same sensing channel. We compare that locally. In theory, I should prove, at least under idealistic cases, this is possible. We can minimize the error.

Once we correct the error, hence the internal representation will the error in the token original data space will diminish. But under technical condition. Under general condition, we still don't know. We actually have a paper prove that for the when your data distribution is a mixture of subspaces, you can rigorously prove that's possible. And if the dimension of the subspace is low enough compared to the capacity of the perception process.

Now, for general distribution, we believe this is true. This is actually how we'll be able to learn all the low dimensional dynamics structure in the natural data, in the motion, in the predicted world. So I think this is something in the future we can deduct. But end to end works if you have the option to do so. Or if you don't have that option, you have to figure out how to do

this autonomous, under what condition you can do this autonomously, and allows you to do autonomously to reduce the error to almost zero. We spoke about Dino, but another example would be VIT. We interviewed Lucas Bayer in Switzerland earlier this year, so he invented VIT. And if I understand correctly, Crate is now very, very close to VIT, but it's so much more principled, it's explainable, and so on. How close are we to knocking VIT off the the leaderboard, if you like?

In fact, I think in many of the comparison, we're already very close. If you compare, it's hard to compare apple to apple, but in terms of if we use similar, the parameters, we are very much on par. And also, by the way, we never really quite put much engineering effort into it. We just want to verify the concept. Indeed, there one thing come out of the Crit is that what we found is not only the architecture design is principled, but then once we did the training, right, the internal structure learned are both semantically, statistically, and geometrically very meaningful.

Indeed, each head actually does learn, you know, similar structures or gets basically each channel, each head truly become expert of certain type of visual patterns. For example, legs of animals, years of animals, faces of animals. So, we see that very clearly with crates. But we don't observe that in the VIT. Of course, you can see VIT may learn.

This is actually the interesting thing, right? Early days, people I'm sure large models, if they have redundancy, they definitely learn things internally. But it's very hard to say which part of that network learned the correct channels, learned the correct operators, because it is embedded in a more redundant structure, right? So early days, people called this lottery, know Yeah, lottery. Lottery or lottery tickets, right?

It's somewhere in there, right? Then people try to distill. That justify you should distill. You should actually be able to compress. Even people do this LoRa thing.

All the post processing justify that is necessary. And some people, you find after the post processing, not only the network becomes smaller, the performance gets better, right? And so on and so forth. Now, probably we don't have to do that. At least the architecture does what it's supposed what it's designed to do, right?

And we can actually at least explain what each component is doing, something statistically, geometrically very meaningful. And there are also results if there's enough data. If your optimization is done, training is successful, that's those structure pops up naturally. The the structure will do what they're designed to do.

Yeah. And and final question. Many ML engineers and researchers watch the show. Given everything we've spoken about, how can they find out more about your work, and how can they get started building these kind of architectures?

I think most of our architecture are open sourced on GitHub, including create early Redunet. Redunet may not be as conceptual, but not very practical. Create and also even TOST, all codes are available. By the way, are sort of kind of academic improvement. We we'd never be able to have a resource to scale them up.

Most are scaled up to GPT-two or ImageNet-twenty one. That's the sort we can afford. D node, simply D node is the one we scale the most. Exhaust a lot of resource, a little bit higher than that, but still no comparison to all this industrial scale at all. But I do believe that Meta and Google are doing something about Dino, simplified Dino.

And the codes are there. And also, of course, even for the methodology, of course, this is one way why we bite the bullet to roll the book in the past two years. We believe that although there's a series of papers, but we believe that for people to get a big picture, the more systematic introduction, we put together the books. Also, we open sourced it. And we will post or link all the data, all the code as well.

We are also teaching the course. So all the we actually will have students practice most of the new architectures method. So all those codes will be made publicly available. And so I think that might be a good entrance if you want people to learn the methodology, understand the theoretical chain of evidence, and also even the empirical chain of evidence. And I think the book has an attempt to do that.

We are already start to organizing. We're not done yet, but we are already organizing. If you find out chapter seven, we're already doing that in chapter seven to collect the theory seriously to order real world data and the task, such as image classification, image segmentation, and the pre training, and even language, GPT tool type scale, language models as well. Yeah.

Professor Ma, it's been an absolute honor. Thank you so much for joining us today.

Yeah. Thank you very much. Yeah.

Machine Learning Street Talk (MLST)

The Mathematical Foundations of Intelligence [Professor Yi Ma]

0:00 / 0:00

View original episode →