FAQ - Monty
Frequently asked questions around the algorithm implemented in Monty.
Below are responses to some of the frequently asked questions we have encountered. However, this is not an exhaustive list, so if you have a question, please reach out to us and the rest of the community at our Discourse page. We will also update this page with new questions as they come up.
General
What is the Difference Between the Thousand Brains Theory, Monty, and Thousand-Brains Systems?
We use these terms fairly interchangeably (particularly in our meetings), however the Thousand Brains Theory (TBT) is the underlying theory of how the neocortex works. Thousand-brains systems are artificial intelligence systems designed to operate according to the principles of the TBT. Finally, Monty is the name of the first system that implements the TBT (i.e. the first thousand-brains system), and is made available in our open-source code.
Neuroscience Theory
What is the Relationship of the Thousand Brains Theory to the Free Energy Principle?
The Free-Energy Principle and Bayesian theories of the brain are interesting and broadly compatible with the principles of the Thousand Brains Theory (TBT). While they can be useful for neuroscience research, our view is that Bayesian approaches are often too broad for building practical, intelligent systems. Where attempts have been made to implement systems with these theories, they often require problematic assumptions, such as modeling uncertainty with Gaussian distributions. While the concept of the neocortex as a system that predicts the nature of the of world is common to the Free-Energy Principle and the Thousand Brains Theory (as well as several other ideas going back to Hermann von Helmholtz), we want to emphasize the key elements that set the TBT apart. This includes, for example, the use of a modular architecture with reference frames, where each module builds representations of entire objects, and a focus on building a learning system that can model physical objects, rather than on fitting neuroscience data.
Do Cortical Columns in the Brain Really Model Whole Objects, Like a Coffee Mug in V1?
One important prediction of the Thousand Brains Theory is that the intricate columnar structure found throughout brain regions, including primary sensory areas like V1 (early visual cortex), supports computations much more complex than extracting simple features for recognizing objects.
To recap a simple version of the model (i.e. simplified to exclude top-down feedback or motor outputs):
We assume the simple features that are often detected experimentally in areas like V1 correspond to the feature input (layer 4) in a cortical column. Each column then integrates movement in L6, and uses features-at-locations to build a more stable representation of a larger object in L3 (i.e. larger than the receptive field of neurons in L4). L3’s lateral connections then support "voting", enabling columns to inform each-other’s predictions. Some arguments supporting this model are:
i) It is widely accepted that the brain is constantly trying to predict the future state of the world. A column can predict a feature (L4) at the next time point much better if it integrates movement, rather than just predicting the same exact feature, or predicting it based on a temporal sequence - bearing in mind that we can make movements in many directions when we do things like saccade our eyes. Reference frames enable predicting a particular feature, given a particular movement - if the column can do this, then it has built a 3D model of the object.
ii) Columns with different inputs need to be able to work together to form a consensus about what is in the world. This is much easier if they use stable representations, i.e. a larger object in L3, rather than representations that will change moment to moment, such as a low-level feature in L4. Fortunately this role for lateral connections fits well with anatomical studies.
We use a coffee mug as an illustrative example, because a single patch of skin on a single finger can support recognizing such an object by moving over it. With all this said however, we don’t know exactly what the nature of the “whole objects” in the L2/L3 layers of V1 would be (or other primary sensory areas for that matter). As the above model describes, we believe they would be significantly more complex than a simple edge or Gabor filter, corresponding to 3D, statistically repeating structures in the world that are cohesive in their representation over time.
It is also important to note that compositionality and hierarchy are still very important even if columns model whole objects. For example, a car can be made up of wheels, doors, seats, etc., which are distinct objects. The key argument is that a single column can do a surprising amount, more than what would be predicted by artificial neural network (ANN) style architectures, modeling much larger objects than their receptive fields would indicate.
Why is There no Hierarchical Temporal Memory (HTM) in This Version of Monty?
We are focused on ensuring that the first generation of thousand-brain systems are interpretable and easy to iterate upon. Being able to conceptually understand what is happening in the Monty system, visualize it, debug it, and propose new algorithms in intuitive terms is something we believe to be extremely valuable for fast progress. As such, we have focused on the core principles of the TBT, but have not yet included lower-level neuroscience components such as HTM, sparse distributed representations (SDRs), grid-cells, and active dendrites. In the future, we will consider adding these elements where a clear case for a comparative advantage exists.
Alternative Approaches to Intelligence
What is the Relationship of the Thousand Brains Theory to Robotics Algorithms That Use Maps?
There are deep connections between the Thousand Brains Theory and Simultaneous Localization and Mapping (SLAM), or related methods like particle filters. This relationship was discussed, for example, in Numenta’s 2019 paper by Lewis et al, in a discussion of grid-cells:
“To combine information from multiple sensory observations,
the rat could use each observation to recall the set of all
locations associated with that feature. As it moves, it could
then perform path integration to update each possible location.
Subsequent sensory observations would be used to narrow
down the set of locations and eventually disambiguate the
location. At a high level, this general strategy underlies a set of
localization algorithms from the field of robotics including Monte
Carlo/particle filter localization, multi-hypothesis Kalman filters,
and Markov localization (Thrun et al., 2005).”
This connection points to a deep relationship between the objectives that both engineers and evolution are trying to solve. Methods like SLAM emerged in robotics to enable navigation in environments, and the hippocampal complex evolved in organisms for a similar purpose. One of the arguments of the TBT is that the same spatial processing that supported representing environments was compressed into the 6-layer structure of cortical columns, and then replicated throughout the neocortex to support modeling all concepts with reference frames, not just environments.
Furthermore, Monty's evidence-based learning-module has clear similarities to particle filters, such as its non-parametric approximation of probability distributions. However we have designed it to support specific requirements of the Thousand Brains Theory - properties which we believe neurons have - such as binding information to points in a reference frame.
So in some ways, you can think of the Thousand-Brains Project as leveraging concepts similar to SLAM or particle filters to model all structures in the world (including abstract spaces), rather than just environments. However, it is also more than this. For example, the capabilities of the system to model the world and move in it magnify due to the processing of many, semi-independent modeling units, and the ways in which these units interact.
What is the Relationship of the Thousand Brains Theory to Swarm Intelligence?
There are interesting similarities between swarm intelligence and the Thousand Brains Theory. In particular, thousand-brains systems leverage many semi-independent computational units, where each one of these is a full sensorimotor system. As such, the TBT recognizes the value of distributed, sensorimotor processing to intelligence. However, the bandwidth and complexity of the coordination is much greater in the cortex and thousand-brains systems than what could occur in natural biological swarms. For example, columns share direct neural connections that enable them to form compositional relations, including decomposing complex actions into simpler ones.
It might be helpful to think of the difference between prokaryotic organisms that may cooperate to some degree (such as bacteria creating a protective biofilm), vs. the complex abilities of eukaryotic organisms, where cells cooperate, specialize, and communicate in a much richer way. This distinction underlies the capabilities of swarming animals such as bees, which, while impressive, do not match the intelligence of mammals. In the long-term, we imagine that Monty systems can use communicating agents of various complexity, number and independence as required.
Why Does Monty Not Make Use of Deep Learning?
Deep learning is a powerful technology - we use large-language models ourselves on a daily basis, and systems such as AlphaFold are an amazing opportunity for biological research. However, we believe that there are many core assumptions in deep learning that are inconsistent with the operating principles of the brain. It is often tempting when implementing a component in an intelligent system to reach for a deep learning solution. However, we have made most conceptual progress when we have set aside the black box of deep learning and worked from basic principles of known neuroscience and the problems that brains must solve.
As such, there may come a time where we leverage deep learning components, particularly for more "sub-cortical" processing such as low-level feature extraction, and model-free motor policies (see below), however we will avoid this until they prove themselves to be absolutely essential. This is reflected in our request that code contributions are in Numpy, rather than PyTorch.
What is the Relationship of the Thousand Brains Theory to Reinforcement Learning?
Reinforcement learning (RL) can be divided into two kinds, model-free and model-based. Model-free RL can be used by the brain, for example, to help you proficiently and unconsciously ride a bicycle by making fine adjustments in your actions in response to feedback. Current deep reinforcement learning algorithms are very good at this (Mnih et al, 2015). However, when you learned to ride a bicycle, you likely watched your parents give a demonstration, listened to their explanation, and had an understanding of the bicycle's shape and the concept of pedaling before you even started moving on it. Without these deliberate, guided actions, it could take thousands of years of random movement in the vicinity of the bicycle until you figured out how to ride it, as positive feedback (the bicycle is moving forward) is rare.
All of these deliberate, guided actions you took as a child were "model-based", i.e. dependent on models of the world. These models are learned in an unsupervised manner, without reward signals. Mammals are very good at this, as demonstrated by Tolman's classic experiments with rats in the 1940s. However, how to learn and then leverage these models in deep reinforcement learning is still a major challenge. For example, part of DeepMind's success with AlphaZero (Silver et al, 2018) was the use of explicit models of game-board states. However, for most things in the world, these models cannot be added to a system like the known structure of a Go or chess board, but need to be learned in an unsupervised manner.
While this remains an active area of research in deep-reinforcement learning (Hafner et al, 2023), we believe that the combination of 3D, structured reference frames with sensorimotor loops will be key to solving this problem. In particular, thousand brains systems learn (as the name implies) thousands of semi-independent models of objects through unsupervised, sensorimotor exploration. These models can then be used to decompose complex tasks, where any given learning module can propose a desired "goal-state" based on the models that it knows about. This enables tasks of arbitrary complexity to be planned and executed, while constraining the information that a single module needs to learn about the world. Finally, the use of explicit reference frames increases the speed at which learning takes place, and enables planning arbitrary sequences of actions. Like Tolman's rats, this is similar to how you can navigate around a room depending on what obstacles there are, such as an office chair that has been moved, without needing to learn it as a specific sequence of movements.
In the long term, there may be a role for something like deep-reinforcement learning to support the model-free, sub-cortical processing of thousand-brains systems. However, the key open problem, and the one that we believe the TBT will be central to, is unlocking the model-based learning of the neocortex.
Can't Deep Learning Systems Learn "World Models"?
We believe that there is limited evidence that deep learning systems, including generative pre-trained transformers (GPTs) and diffusion models, can learn sufficiently powerful "world models" for true machine intelligence. For example, representations of objects in deep learning systems tend to be highly entangled and divorced of concepts such as cause-and-effect (see e.g. Brooks, Peebles et al, 2024), in comparison to the object-centric representations that are core to how humans represent the world even from an extremely young age (Spelke, 1990). Representations are also often limited in structure, manifesting in the tendency of deep learning systems to classify objects based on texture more than shape (Gavrikov et al, 2024), an entrenched vulnerability to adversarial examples (Szegedy et al, 2013), the tendency to hallucinate information, and the idiosyncrasies of generated images (such as inconsistent numbers of fingers on hands), when compared to the simpler, but much more structured drawings of children.
Instead, these systems appear to learn complex input-output mappings, which are capable of some degree of composition and interpolation between observed points, but limited generalization beyond the training data. This makes them useful for many tasks, but requires training on enormous amounts of data, and limits their ability to solve benchmarks such as ARC-AGI, or more importantly, make themselves very useful when physically embodied. This dependence on input-output mappings means that even approaches such as chain-of-thought or searching over the space of possible outputs (e.g. the recent o1 models), are more akin to searching over a space of learned "Type 1" actions, rather than the true "Type 2" (Stanovich and West, 2000), model-based thinking that is a marker of intelligence.
Updated 2 days ago