HOME

Planning for AI without LLMs

45 min read

LLMs are transforming many areas of technology and beyond with capabilities that appear to be endless. However, although language-based AI technologies are undoubtedly extremely impressive, key frontiers of AI such as physical intelligence (i.e. robotics) and independent scientific discovery are still ahead of us. Are LLMs the right paradigm to get us there? Here I examine computations known as planning, which have long been at the centre of AI research, are a core part of animal intelligence, and are widely believed to be a crucial pillar of the aforementioned milestones. I ask whether LLMs are really suited to this space of problems and what alternative paths the field has in mind.

Introduction

AI technologies, and those built on Large Language Models (LLMs), are at the centre of many polemical economic, political and cultural contemporary discussions. One such discussion revolves around the trajectory of this technology: how advanced will it be in one year, a decade, a century from now. These analyses have swept up a range of ideas across economics, sociology, ethics and computer science. Conventional concepts such as Universal Basic Income have a long history of association with imagined techno-utopias; other longstanding observations like Jevons paradox or Moore's law, seem to have found new analogies with an AI flavour; while the outright bizarre has also managed to find a home in this discourse, for instance in the adaptation of Pascal's wager to 'Roko's Basilisk'. For a long time these speculations have exclusively been the domain of science fiction writers and subject-matter experts, but recent advancements and the collective sense of imminence around the potential impact of AI on security, labour markets, scientific discovery, and beyond, have propelled these discussions firmly into the mainstream.

Here, I am going to focus on one particular perspective on this set of questions, namely the assertion that LLMs are fundamentally limited, and that while an incredibly useful and possibly transformative tool, LLMs as a strategy to achieve more general forms of intelligence are doomed to fail. Large Language Models operate in the domain of text, of words, and do so impressively. Ironically, in evolutionary terms, language is a late arrival to our collective cognitive abilities: humans began hunting prey around 1 million years ago, and foraging for other foods long before that. Modern language, on the other hand, with abstraction and complex grammar, emerged only around 50-100 thousand years ago. Animals far less evolved than humans understand on an intuitive level concepts like friction and gravity. Within the lifespan of a human too, language abilities emerge relatively late. Although babies may splutter out their first words within a year, structured communication abilities are typically attained around 4-7 years old. On the other hand, cognitive features like grasping object permanence and early memory formations are reached within months of birth. Yet true intelligence requires this understanding of physics—grounded in the real world; LLMs have skipped ahead, mastering language despite not being equipped with the right properties to exhibit other forms of intelligence that are second nature to humans and even many animals. The above arguments have been made by various researchers, most notably Turing award winner Yann LeCun, who recently left Meta—reasons for which include disagreement related to the topics here—to start his own AI lab focused on a different machine learning paradigm known as World Models (more on this later).

This contrarian perspective might raise eyebrows among many following current AI hype cycles, but for those familiar with the history of these technologies, disagreement of this kind should come as no surprise. Each generation of AI progress since its formal inception as a field in a notorious 1956 Dartmouth conference has been accompanied by seemingly equal parts excitement and cynicism from within the research community. This typically coincides with a new wave of predictions concerning the timeline to passing increasingly specialized versions of the Turing test, and more broadly to meeting a wider array of intelligence definitions—narrow, general, super, universal. The perceptron led the way in 1957, followed by a wave of rule-based AI algorithms like A*, the Restricted Boltzmann Machine, and CNNs. More recently fads centred around the likes of GANs (Generative Adversarial Networks) have sandwiched themselves between more resilient eras of RNN-led (Recurrent Neural Network) sequence modelling and efforts in deep reinforcement learning. The latest cycle can be traced back to the advent of the transformer architecture in 2017 and has given rise to by far the most disruptive of these technologies, Large Language Models. Your favourite VC or scaling maximalist CEO (whose publicly expressed opinions clearly are not influenced at all by their shareholder obligations) might tell you that AI is now all but solved, and if compute infrastructure projects could just secure another 100 billion dollars of debt, we could stop being so concerned about privacy, copyright, anti-trust, net-zero and other pesky grievances, we are months away from an intelligence explosion in which we as mere mortals can sit back and watch a tech Udistopia materialize before us. However, the history of AI is too short and too full of hyperbolic seasonality of sentiment for such certainty to be universal; debates in academic circles around the best directions for research in AI continue with a tad more caution.

What is an LLM?

In order to understand the nature of the critiques of LLMs, it is necessary to establish a basic understanding of an LLM and what it is trying to achieve. It is not actually necessary to understand any of the maths behind these models or even any of the inner workings (like transformers) to grasp the spirit of the purported limitations. As far as the user is concerned, the purpose of an LLM and cognate frontier models under the banner of "generative AI" is to generate some kind of information based on a prompt or context. For your chatbot (like chatGPT), this is in the form of words e.g. you ask a question and it produces (generates) a response. For image- and video-generation like Midjourney or Sora, the prompt is usually text and the output is a set of pixels. Meanwhile technologies such as Claude Code have proved to be extremely successful at writing code—a subclass of sorts of language. The capabilities of these models to generate meaningful information are underpinned by a surprisingly simple process in the so-called training phase of these models, i.e. the process in which they are made to gradually perform better and better. In general, machine learning training consists of a few components: (i) there is the model, which in most cases is a parameterised function—that is a mapping from an input to an output; (ii) there is a dataset consisting of many labelled examples; (iii) there is some criterion which for a given instance of your data tells you how well your model is able to match the ground truth (i.e. label); (iv) finally there is some mechanism by which we can modify the parameters of the model to improve vis-a-vis this criterion (in most modern ML, this mechanism is the back-propagation algorithm for which Geoffrey Hinton—one of the founding fathers of ML—also shared in the Turing award).

ML Training Components: Model
(i) The model is typically a parametrised function that maps some input xx to some output f(x)f(x). You can think of these parameters as lots of little knobs (denoted θ\theta) that can be turned to change the way xx is transformed. Taken together these knobs can make for extremely complex functions that can capture very intricate relationships in data. For instance, modern transformer networks can have parameters numbering in the trillions.
ML Training Components: Dataset
(ii) The dataset consists of examples of input-output pairs where the output is some 'ground truth' label—often denoted yy. Traditionally in supervised learning, these labels would come from manual processes. In language, the labels are often taken from the text data itself, which is part of what makes this process scalable.
ML Training Components: Objective
(iii) The objective, also known as the loss (which we can denote L\mathcal{L}) or criterion, is a function that tells you how good your estimate is compared to the label. Given the mathematical notation established in (i) and (ii) this is a comparison between f(x)f(x) and yy. Since f(x)f(x) is a function of the parameters (knobs) of the model, this tells us how good the current setting of knobs is for the given example.
ML Training Components: Optimisation
(iv) Armed with a loss on a given example or set of examples (function L\mathcal{L} evaluated on these examples), the aim is to modify the parameters i.e. adjust the knobs to reduce the loss. This can be visualised as shifting in a targeted direction in the landscape shown in (iii) with the aim of getting to lower ground. Fortunately there is a powerful algorithm known as backpropagation that can efficiently compute an assignment for parameter changes.

This recipe has been around for a long time and has been applied to various classes of problems including image classification, time series modelling, event detection etc. Indeed the domain of language has been a key modality for ML researchers for decades. The main axis which has changed in that time has been the class of models used. When I first entered the field the pre-eminent architecture was a so-called LSTM (Long Short-Term Memory), though the simpler RNN was still in use and the Transformer architecture (which underlies today's models) was beginning to gain traction. Most of the rest of this pipeline has remained unchanged: in particular the core task is for a model to predict the next word in a sequence. That this simple task can underlie some very sophisticated tools is not a trivial fact. For a long time, it was not clear that this would be enough; however at this point it is empirically unquestionable that this paradigm can create machines that generate not only coherent language but genuinely useful tools built on language generation. I would argue that this is the core scientific insight of the most recent AI revolution. It is a powerful statement about the world that simply optimising to predict the next word in a sequence can give rise to such a rich understanding, even if only at the surface.

An additional ingredient of the current LLM pipeline, which is a significant complement to the basic recipe described, is the role of reinforcement learning in the 'post-training' phase. Reinforcement learning (RL), which stands in contrast to the supervised learning scheme outlined above, has been a key research area in ML and neuroscience for many decades, and underlies numerous breakthrough achievements such as superhuman gameplay. It is a learning paradigm in which an agent interacts with an environment through actions, after which it can receive reward. In solving an RL problem, the aim is for the agent to develop a strategy (known as a policy) that maximizes the reward it can attain from the environment. For both the purposes of explaining the use of RL for LLM training, and to discuss ideas related to planning later on, it will be useful to introduce the concept of a Markov Decision Process (MDP). This is a fundamental idea from decision sciences that emerged gradually with work from the likes of Andrey Markov and Richard Bellman in the first half of the 20th century and has come to define problems across multiple cognate disciplines. An MDP serves as an abstraction governing the interactions of an agent with an environment (i.e. a world). For instance, in AI, the world might be a game like Chess and the agent is some neural network solver; more practically the world might be a living room and the agent is a Roomba. When modelling animal behaviour in neuroscience, the world might be a maze and the agent is a rodent navigating the maze. Formally, it is defined as a tuple

M=S,A,P,γ,R,\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \gamma, \mathcal{R} \rangle,

where

  • S\mathcal{S} is known as the state space, which contains all the possible configurations that the world can be in.

  • A\mathcal{A} is the action space and specifies the possible actions an agent can take in the world.

  • P\mathcal{P} is the state-transition function, which determines how the world will evolve after a given action is taken from a given state.

  • γ\gamma denotes the discount factor, which is a number between 0 and 1 that controls how much reward should be weighted to short time horizons. For instance, if γ\gamma is 1, then rewards at all times should be weighed equally, while if γ\gamma is 0.1, then rewards in the distant future will effectively be ignored.

  • Finally R\mathcal{R} is the reward function, which is a scalar number that is given to the agent at each timestep. As the name suggests it signifies the objectives of the agent, i.e. what is good and bad in the environment.

The evolution of the MDP and the agent's interaction therewith can also be visualized succinctly in the following loop:

RL Interaction Loop
Schematic depicting the interaction of an RL agent with an environment.

While RL has been a cornerstone of modern AI research, it took on a new mantle very recently as a pivotal component of post-training in LLMs. Roughly speaking, in this formulation, the environment is word-based insofar as the states are given by the prompts to the model; the LLM is the agent, and the actions it takes are the responses it produces. The reward signal is given in two different ways depending on the nature of the query, both of which are schematized below. For so-called 'verifiable' domains such as coding or maths, where there are clear notions of a correct or incorrect answer, the reward signal can be constructed in a similar way to the win-loss(-draw) games above. This setup is relatively scalable because there are both many labelled examples for maths problems (think textbooks), and we can use symbolic solvers to provide more, while for coding it could even be as simple as whether the code compiles for certain problems. On the other hand, there are domains that are not verifiable where there is no absolute sense of a right or wrong answer, such as in general interactions with a chat-bot. Here, a clever technique has been developed called RLHF (Reinforcement Learning with Human Feedback) in which human experts examine a set of answers given by the LLM and rank them to provide a reward signal to the model. In both of these cases, a class of RL algorithms called policy-gradient methods are used to change the weights such that they are better at the respective task according to the specified reward signal.

RL Post-Training Loop
Schematic depicting RL post-training in both verifiable and non-verifiable domains (RLHF).

The RL loop that has been bolted onto the standard supervised learning pipeline has also elevated the prominence of another important epiphenomenon of modern language models known as Chain of Thought (CoT). Consider again the computations behind the function in (i) above: one can think of them as some sequence of transformations applied iteratively to the input. These transformations are unfathomably high-dimensional, which makes them both very expressive and very difficult to interpret. However, in a 2022 paper, researchers showed that by prompting a model to show intermediate steps in its output (like asking a child to show their work), performance on reasoning tasks improved. This became known as CoT prompting, and the actual language trace of this 'thought' within the output of the model is known as the CoT. While it is difficult to make any precise claims about the formal correspondence between the internal computations of the model and the trace shown by the model, empirically performance improves, which suggests that the presence of CoT or CoT prompting correlates with computations that are more conducive to reasoning. A simpler, if imprecise, explanation for the effectiveness of CoT is that the additional words that the model outputs as part of its CoT provide extra context that in turn is input back into the model forcing the model to increase the computational budget assigned to the task. As for RL, this is one clear way for us to explicitly reward a model for giving answers that provide a good basis for CoT.

Understanding

Although a great many details of LLM training have been glossed over, up to this point most of the material is in place to understand the foundational criticism of LLMs. In the early generations of applications built on LLMs, complaints focused mostly on the phenomenon of 'hallucination', in which these models effectively make stuff up—sometimes with high confidence. This problem has recently been shown to be inherent to the kinds of auto-regressive modelling that LLMs rely on and so cannot be avoided with any guarantees, but improvements have undoubtedly been made with advanced model generations. Other growing pains included issues with basic counting and arithmetic, which are mostly easily explained by design choices around tokenization or training data. However, from the perspective of general intelligence, there is a more fundamental issue with LLMs that I want to address here: in the coarsest terms, the criticisms centre around LLMs not being able to 'truly understand' the world. But what exactly does this mean? And how could this be if these models are able to solve olympiad level maths problems and write entire codebases from scratch? 'Understand' is of course a very vague term. A more concrete concept is a model or in modern ML parlance, world model. While there may not be a ubiquitous formalization of world models, one can broadly think of a world model as a conception of how the state of something will develop over time in response to actions taken. Why is this considered important, and why is it equated so commonly with the idea of understanding? At the heart of it is the idea of planning.

Planning

Planning is a process in which some set of actions is identified and evaluated against an objective. In navigation this may include plotting out a path through some obstacles to a goal, in gameplay this may involve simulating your and your opponent's next set of moves and establishing whether a desirable outcome is likely, and so on. In what would now be considered 'old-school AI' or symbolic AI, planning was one of the most important research areas. Many early results in the decades around the Dartmouth conference were in this vein, including algorithms like A*, STRIPS, as well as theoretical results around computational complexity and constraint satisfaction problems. Consider the following representative problem: you are taking trains through the United States, you then realise that even for a hypothetical problem in a blog post this is surely too fantastical and consider instead you are interrailing through Europe. You are currently sipping on a glass of Vranac in the town square of Podgorica, Montenegro and would like to make your way to Katowice, Poland. It's a leisurely holiday, but you'd like to be efficient, so you want to find the shortest distance to travel. You look at the railway connection map below.

Pogorica Rail Problem
Graph of train connections in Europe. An example search problem is to plan an optimal route from Podgorica to Katowice.

A useful representation for the problem is a tree where the root node is your current position and child nodes correspond to stations reachable from the parent. The first four levels of the tree are shown below.

Pogorica Rail Problem - Tree
The above graph problem can be represented as a tree in which the root node is the origin and successive children nodes are those that can be reached from the parent.

There are various ways to use this tree structure to find a path from the start to the goal, much of which revolves around deciding, given the current node, which node to expand next. A common initial taxonomy of methods is to distinguish between uninformed or blind search, and informed or heuristic search. Uninformed search methods (like BFS or DFS) use only the problem specification, have no notion of goal proximity, and frequently also disregard cost (e.g. vanilla BFS and DFS only care about number of edges, not length of edges). In contrast, informed search methods use heuristics to guide search in more desirable directions. For instance A* will expand nodes based on cost incurred and estimate of remaining cost. This of course requires a good estimate (known as the heuristic), which was a core focus of symbolic AI research for a substantial period in the second half of the twentieth century.

Although the need to design heuristics eventually gave way to more data driven connectionist approaches, symbolic AI continues to play a crucial role in modern AI systems. A well-known example of this is DeepMind's superhuman Go and Chess engines, which combine neural networks with Monte Carlo Tree Search (MCTS). While later versions have made substantive changes, the original setup can be recapitulated quite straightforwardly. There are three main components: two connectionist, and one symbolic—making it a strong poster child for 'neuro-symbolic AI'. On one side are policy networks, neural network models that are trained initially via imitation learning (a subclass of RL methods that is guided by human demonstrations) and subsequently via self-play to output an action for a given state, and value networks, also neural networks that instead output a 'value' of a state, in other words an assessment of how good the current state is. Complementing these is MCTS, which expands nodes in search based on an objective that combines information from both neural network components. At each successive node, both neural networks can provide further information to the search procedure.

We've described above two examples of planning in AI: one toy example you might find in introductory textbooks, and another from close to the frontier of research. More generally, traditional formalisms of planning can be defined with the same MDP abstraction detailed above. In planning, the key assumption is that the transition function, which in essence is the world model, is known. In contrast, modern reinforcement learning does not make that assumption, such that the agent either needs to solve the task without a model (known as model-free reinforcement learning) or by learning that model (known as model-based reinforcement learning). The strong overlap between these formalisms for planning and RL has led to some sharply worded criticisms among, let's call them traditionalists, in other engineering disciplines that modern RL is 'just stochastic control without a model', which is technically correct but certainly underplays the difficulty that those additional constraints impose on the problem.

It can be argued from first principles that planning is an important component of intelligence in that certain tasks require a form of planning to solve (although, as we will show later, this is hard to do rigorously). Another perspective, and one that is often used when defining or benchmarking general intelligence against animal and specifically human intelligence, is to point to swathes of evidence in the cognitive sciences that animals routinely engage in planning for a multitude of behaviors. This line of thinking dates back to at least 1948 when Edward Tolman introduced the idea of a cognitive map, an internal representation of an animal's environment, which permits structured learning and flexible behaviour. His foundational experiments, which have been incredibly influential in neuroscience, demonstrated a phenomenon known as latent learning. Early demonstrations involved placing rats in an environment without rewards before later introducing rewards after acclimatisation. Upon introduction of the reward, rats are rapidly able to solve the task in a way that is inconsistent with pure reinforced habitual learning and purportedly facilitated by the use of an internal model that was 'latently' acquired in the unrewarded phase. Similar findings have been documented in a wide range of settings across species and task paradigms. Another example of behavioural evidence for planning in neuroscience comes from a seminal human experiment on a two-choice task. In this experiment people had to make two binary choices between semantically meaningless Tibetan characters. Each first choice led preferentially (70% probability) to either of the second choices. Two correct choices were rewarded with money. The contingencies drifted slowly to encourage continued learning. If a participant made a choice in the first stage that unusually led to a rewarding choice, model-free and model-based learning would make different predictions about subsequent behaviour. In the model-free case, this sequence of actions should be reinforced and one would expect bias towards the same choices again. However this is not observed, rather the opposite first choice is made because the subject has learned a model of the transition structure, which suggests this strategy is more likely to lead to reward.

Beyond behavioural studies, there is also direct neural evidence for planning signatures in brain activity. This comes primarily from two brain regions: the prefrontal cortex (PFC), which is enlarged in primates compared to other mammals, and is generally implicated in executive functions; and the hippocampus, the centre of the brain responsible for navigation and relational information. Again, to show just one striking example from the hippocampus field, consider the arena shown top-down in the figure below.

Choco Wells Arena
Schematic of experiment in Pfeiffer & Foster 2013. A rat navigates back and forth between some remembered 'home' well and a random target well that changes in each trial.

Each dot in the square represents one of 36 wells that can be filled up with delicious cocoa milk. Experimenters placed rats into this environment and recorded from hippocampus as they navigated between HOME and RANDOM wells. For a given session of recording, the HOME well was fixed, but in each 'trial' the RANDOM well was assigned anew. The goal for the rat was to leave the HOME well and explore until they found the relevant RANDOM well filled with milk; from there they had to return back to the HOME well, which was filled with milk when they reached the previous RANDOM well. Highly performing rats are able to remember where the HOME well is, so in the second portion of that trial they are able to directly navigate back. On the other hand, the first half of the trial requires exploratory behaviour. After the rat receives a given reward he will pause as though contemplating his next move. In this time, recordings from hippocampus suggest exactly this kind of deliberation. Cells known as 'place cells' fire in a time compressed sequence that mirror the firing during the animal's trajectory through the environment. Before returning HOME, this trajectory corresponds to the direct path back HOME, as though simulating the optimal solution, whereas after HOME the trajectories are more variegated suggesting simulation of possible future paths during exploration. This experiment remains one of the strongest pieces of neural evidence for trajectory simulation and evaluation, mirroring those we would expect from planning computations.

More generally, there are various pieces of evidence across cognitive science disciplines through lesion studies, behavioural modelling and neural recordings that animal behaviour is sensitive to unexperienced contingencies, that hippocampus generates structured candidate futures, and that PFC evaluates and selects among these futures, all of which is consistent with an internal model rather than purely habitual learning.

We have established by now that planning has been historically important in the field of AI, and that it underlies animal intelligence. This brings us back to the modern AI landscape: are LLMs and their enhanced LRM siblings capable of planning? This is a hotly debated and active area of research, which suffers from poor consensus on basic definitions, clear bias on the part of the protagonists in the debate on both sides, and at best murky mechanistic insight into what these models actually do. To make matters worse, many of the empirical tools we might have to probe the models behaviourally are muddied by what is known as training set contamination, which is extremely difficult to control for with the way current models are trained.

Can LLMs Plan?

Let us first consider at a high level the process of computation happening in the LLM pipeline and make some concrete statements about what the system is and is not doing and how this could relate to planning. At risk of belabouring the point, LLMs are trained to predict the next word (or token). At inference (i.e. after the training phase when we query the model for an output) it successively chooses from some finite number of words what it thinks is best to continue the preceding sequence. On the face of it there is no explicit mechanism to perform operations like search in the solution space or optimisation over that search. However there are in fact some ways in which very good approximations of planning could be taking place. The first mechanism would involve performing planning 'in-context'. Considered an emergent property, in-context learning, which stands in contrast to in-weight learning, has been an intriguing aspect of LLM performance that has gained a lot of attention. For our purposes, the idea is that the actual tokens generated could serve as a working memory for planning computations; they could encode state information that is iteratively passed back through the network for the next token prediction, during which multi-step algorithms could be undertaken. The CoT, facilitated by RL post-training, is in some ways a minimal example of this kind of idea. The second possibility is that there is such an enormous amount of training data available to the model that its pattern recognition capabilities are so enhanced as to make planning at test time effectively obsolete. The idea is that planning computations have been 'amortised' during training and that search functions are compressed into the parameters; this distillation is further substantiated by post-training methods. In a given setting at test time where planning may be necessary the model has enough experience from training that it is able to provide a good continuation without searching afresh. This angle has in fact been formalised under a probabilistic lens both in neuroscience and computer science as 'planning as inference'.

It is tempting to think of the role of RL in post-training itself as planning. In many ways it does resemble it; however, this process of generating multiple answers and choosing (or having humans choose) the best option only happens during training, not during inference. Furthermore, even then, the current methods for this procedure are quite inefficient: they require doing the full sequence of computations to produce entire trajectories as candidates. While it is an 'informed search' insofar as it is guided by a pre-trained language model with a strong and hopefully informative prior, there is no bootstrapping in the sense of intermediate evaluation at each token generation. In the current paradigm there is not really an alternative to this precisely because the proposals are made in the output space of tokens. Meaning in sentences is not temporally linear, it can often occur after a prolonged context; in RL or search evaluation terms, the reward is non-monotonic. This property breaks assumptions underlying many stepwise planning algorithms. In order for those to work, planning needs to occur in a more abstract concept space. This is already a direction that has been taken in later iterations of the Alpha/Mu lines of research from DeepMind where planning serves action selection at inference. It is more difficult to see how pure auto-regressive language modelling will overcome this limitation, but this abstraction approach is being foregrounded by alternative paradigms that we will come to later.

Despite the structural limitations of the current LLM paradigm, it is also true that by many measures they excel at a range of very difficult tasks (as measured by human standards) that on the face of it require strong reasoning abilities. These include maths, coding, logic and other very structured problem spaces. This suggests that some combination of amortized inference and in-context planning can still be an effective recipe. Digging slightly deeper, however, there is converging evidence that if pushed in novel directions that deviate enough from training data, models struggle to generalise these abilities. A prominent paper in this domain is called 'The Illusion of Thinking'. This was published last year and garnered a lot of attention for its examination of fundamental reasoning ability in LLMs through a series of experiments on synthetic, procedurally-generated problems in which complexity could be controlled and the data contamination problem could be mitigated. They compared performance of LLMs and LRMs in different regimes of complexity, finding that while LRMs can solve moderate difficulties, and supersede LLMs in this regard, there is a complexity threshold beyond which both collapse completely. The failure mode is attributed to the lack of explicit reasoning algorithms and consequent inconsistency when pushed beyond what the models can extrapolate from their training data. An example problem they studied was the famous Tower of Hanoi game. While the original game has three rings and three poles it can logically be extended to size NN with NN rings (still with 3 poles) with the exact same instructions and constraints. Some specific results in this study were later disputed and semi-refuted, for instance it was found that the solution description of the Tower of Hanoi problem at some complexity level exceeded the token budget of the model, meaning it was impossible to solve by construction. But overall the results and conclusions are accepted by the community.

Detractors will question how convincing the arguments made in papers like 'The Illusion of Thinking' really are because they do not constitute anything close to a formal proof. Most papers in this domain on either side of the argument follow a similar blueprint of presenting a problem, probing whether their favourite LLM can solve it, and concluding that they therefore can or cannot perform that class of problem in general. There are numerous issues with this approach: firstly it is extremely difficult to ascertain what is and isn't in the training data and how that might relate to the test problem and a positive result. Secondly, there remain various resource constraints in model training and inference, limiting the universality of any negative results with respect to the overall approach of LLMs. Studies in this vein can still make for very useful and thought provoking papers, but they are unlikely to sway too many opinions. To understand one alternative approach, which granted is probably much more difficult to engage in, I will introduce three further organizational abstractions in the field of learning and computation. These are function classes, the computational complexity hierarchy and the Chomsky hierarchy, which are shown in the diagram below.

Computational Hierarchies

In the first diagram, the largest space shown is called the function space; this contains every function possible i.e. every way in which I can transform some input to some output. Within that space are various sub-spaces, and in estimation problems they are specifically called hypothesis spaces, which define the set of functions that you can achieve with a given class. For example, if my function takes the form y=mx+cy = mx + c where xx is the input, yy is the output and mm and cc are free parameters, then I can represent any linear function. If y=aekxy = ae^{kx} where aa and kk are free parameters, then I can represent any exponential function. Individual points in this space represent specific functions (e.g. a specific choice of mm and cc for the linear case). Modern neural networks, which underlie the technologies we are discussing here, have many free parameters and can represent a huge number of functions as a result; in fact there are formal results that show that various neural network classes are approximately 'universal'. Graphically we can picture this as H\mathcal{H} roughly filling up F\mathcal{F} when H\mathcal{H} is a neural network. The next two diagrams are classic taxonomies from the field of computational complexity. The first is a computational complexity hierarchy that shows in successively larger rings distinctly more 'complex' problem spaces. Each ring gives a specification for time and space (memory) resources required to solve the problem. For example, PSPACE corresponds to problems needing polynomial memory as a function of the problem size. This is a very coarse version of this diagram. Computer scientists continue to refine these categorizations, finding and merging subspaces for given model classes such that these diagrams can be drawn with hundreds of rings in what has rather affectionately been termed the 'complexity zoo'. Importantly, these classes are defined with respect to a computational model, which is the domain of the final diagram—the Chomsky hierarchy. Here the focus is the structure of the memory that a model is endowed with; for instance on one end, regular grammars correspond to finite automata memory systems, while recursively enumerable languages rely on Turing machines. There are numerous formal results mapping between the Chomsky hierarchy and different versions of the computational complexity diagram. In particular, there is a direct relationship between each ring in the Chomsky hierarchy and a Turing machine with a specific space restriction. Most modern complexity science focuses on the Turing machine as a model of computation and thus the Chomsky hierarchy of interest has largely collapsed to a single ring. On the one hand it is commonly assumed that neural network models like transformers are Turing complete (meaning they are functionally equivalent to a Turing machine), which suits this collapse; these assumptions have been vindicated by recent theoretical analyses, although even here assumptions on context length, precision etc. upon which these results rely are not satisfied in practice. This has implications for how we might think about problems like planning where we separately also have a range of results (primarily from earlier eras of AI research) on where different algorithms lie in the complexity hierarchy. For example, certain formulations of deterministic finite horizon planning are PSPACE complete. Whether or not current transformer models are truly Turing complete and, if not, the extent to which they are approximations thereof, will determine our understanding of many of these worst-case bounds in this tradition.

The discussions above lay out a path to finding more formal answers to the question of whether LLMs can plan from a computability perspective. However, there is yet another theoretical hitch in all this. Even if you could prove that a given language model architecture had the requisite representational and computational expressivity to solve a given class of complex problem that included planning, it would not actually be enough—and given what we know about theory in other domains of deep learning, it might be far from it. This is precisely because of the recipe I outlined earlier: neural networks are not hand-programmed to the solution (if only it were that easy); they are trained. During the process of training, which is dependent on myriad interacting variables including the data, optimizer and architecture, the function that your network represents takes a path from some random state to some specific state. But there is no guarantee that the function you are looking for can actually be found in this process even if you know it to theoretically exist. In analogy, imagine the bakery that serves your favourite cake lists the exact ingredients on the menu but does not provide the instructional recipe. To make matters worse, you must start your attempts at replicating the cake with a random recipe that you can only modify incrementally in each successive attempt. You know you have the building blocks of the cake, but the prospects of getting to the solution are slim. This space of problems sits at the intersection of numerous notions of 'hardness': representability, statistical, and optimization. There are numerous questions that need to be answered:

  • In what complexity class is the planning problem?
  • Can an instance of your LLM represent the relevant functions for planning: value function, belief state, etc.?

  • Can an instance of your LLMs compute the necessary operations with those functions: branching, etc.?

  • Is it feasible to find this instance of your LLM with your training procedure?

There are multiple specific results in the first question type, but they are highly caveated and most 'real world' problems are relaxed versions of these up to the precision that we would require to deem the problems solved or solvable. For the second and third question, most large models are thought to satisfy both of these in principle. But together these 'relaxations' are more than technicalities because ultimately these models are resource constrained (finite context length, finite precision etc.).

The attack surface for questions on LLM capabilities, even specifically vis-a-vis planning, is very large—both theoretically and empirically. Scientifically they can all be interesting, but my view is that some will be more pertinent than others in informing our collective strategies regarding the path to general artificial intelligence.

Contenders & Pretenders

At this point most people probably are in agreement that LLMs can approximate some sort of planning through mechanisms akin to amortised inference, and can do so very effectively for a restricted class of problems. It is also clear that this kind of approach to planning is neither particularly efficient nor anything like the solution that evolution has found in natural intelligence. Despite this, there is a belief that simply continuing to scale the current paradigm will continue to bring about progress in increasingly more general problem settings including those involving more advanced planning requirements. Any inefficiency in the approach will either be ignored on account of "well who cares how many lakes we drain in the process", or worked around with clever tricks at the margins of the paradigm. However, this is not the only game in town; calls from opponents of the current regime to invest in alternative ideas are growing, and they are being made from titans of the field no less.

Alternatives are represented by a few schools of partially overlapping thought. There is the experience-driven camp, led by titans like Rich Sutton and David Silver who believe that all of intelligence is moulded through many interactions and experiences with the world. This philosophy correlates with the primacy of Reinforcement Learning and a process of learning shaped purely by a scalar reward signal—evolution or sufficient learning resources will take care of the rest. These ideas can be related to some lines of work on evolutionary algorithms from the likes of Kenneth Stanley and Jeff Clune. In research areas more closely rooted directly in robotics and physical artificial intelligence, there are efforts to train foundation models in an 'embodied' way such that the agents are endowed more directly with a sense of physics. Another approach is built around the idea of diffusion. These techniques are already ubiquitous in image and audio models because they work more naturally in continuous spaces, but efforts in language modelling with diffusion are ongoing. Many of the immediate benefits of diffusion are computational, but there are also properties of these models that overcome some of the limitations of auto-regressive language models for planning. For instance because diffusion iteratively refines its estimates in a non-causal way (in terms of sentence order) it could be better suited to bootstrapping. Finally, there is the direction of world models. Efforts in various guises are underway with Genie and dreamer models at DeepMind, as well as the odd startup here and there. Arguably the mathematical formalism for world models, and a concrete vision for a path towards general intelligence built around this formalism has been pushed forward more than anyone else by Yann LeCun. In his position paper titled 'A Path Towards Autonomous Intelligence', he outlines a vision for autonomous intelligence that leverages multiple learning paradigms (including vision modules, control), but that above all sits on a learned world model trained via self-supervised learning. This world model—contrary to some of the examples mentioned before—is designed with planning in mind.

The philosophy of learning underpinning LeCun's vision of world models is called JEPA—Joint-Embedding Predictive Architecture. While there have been a spate of proposed implementations of JEPA across vision, robotics and RL, the underlying idea is always the same, and crucially shifts the nature of the prediction problem away from what the current mainstream across machine learning of making and evaluating predictions in the output space (as depicted in the recipe at the top of this article, e.g. in the space of words) to doing so in a latent space. Definitionally, a latent space is one that is not observed; in neural network terms this effectively corresponds to a projection of inputs to some space that is neither the same space as the inputs nor outputs. Moreover, conceptually crucial for JEPA, a latent space should compress the overall information content in a structured way. To illustrate this idea, consider the following JEPA-styled architecture for vision, named DINO:

DINO Schematic
Schematic of DINO architecture.

DINO follows a teacher-student setup in which there are two neural networks, one of which provides the effective label (the teacher) and one of which is being trained (the student). In this case the teacher and student receive as input a different corruption of the same underlying image. They both encode these images into some latent space and the student is trained to perform a similar encoding to the teacher. Crucially, there is no reconstruction here, the loss is computed in the latent space. Over time the network will learn an informative encoder for the dataset of images, which can be used for various tasks downstream. Learning a vision encoder in this way is neat, but the same philosophy can also be leveraged to learn world models. Again there have been numerous attempts at this, but in the same lineage of work as DINO, is DINO-WM, which takes a similar approach to learning environment dynamics in a range of simulated robotics tasks. They use a later variant of the vision model described above (DINO-v2) to encode a series of observations into a sequence of latent states. They then use a vision transformer (ViT), which is an image processing model with similar architectural properties as the sequence model used in LLMs, to predict the next latent state from this sequence and the actions taken in between. At test time, they can use this predictive world model to do planning. However, because the space in which the predictions take place is in a compressed latent space, the computations are far more efficient and they can use relatively straightforward algorithms for control.

To understand why this shift is considered fundamental to ambitions of general planning agents, consider the example of video prediction. In contrast to language, video prediction remains very difficult in part because it is much higher dimensional. Next-token prediction in language corresponds to choosing one option out of around one hundred thousand (by modern LLM vocabulary sizes). This is a large number to be sure, but consider a video in standard definition (moderate by today's video standards): each frame contains over three hundred thousand pixels, each of which can take a range of continuous values depending on the representation choice. So rather than making one token prediction, you are making three hundred thousand pixel predictions, and each is over effectively a continuous space vs the discrete vocabulary size of around one hundred thousand. This is intuitively intractable, and indeed mathematically this difficulty has been extensively formalised as the curse of dimensionality. Computationally the only hope is to compress this information and make predictions in a constrained space. This is the core motivation behind JEPA and related ideas. When animals, including humans, move around the world, they are able to make accurate predictions about how the world will evolve insofar as is important for behaving in the world. If I tip over a glass I cannot predict precisely which area of my table will get wet, but I know that the water will spill out of the glass.

Conclusion

The past ten years have been an incredibly exciting time in the field of AI. Ironically, however, depending on the vantage point i.e. unless you are working on LLMs specifically, the last few years have probably been more interesting at the intersection of commerciality and AI than in AI research itself because investment and funding pressures have shrunk the diversity of ideas and funneled attention overwhelmingly to a single paradigm. There is a feeling that the tide will turn again slowly in the next few years and the field can return to an exploratory mode with many new and exciting research directions. World models are sure to be at the forefront of this exploration.

For my part, I am interested in related questions at the intersection of natural (i.e. neuroscience) and artificial intelligence that may yet inspire and inform the development of better machine learning of the kind discussed here. These include:

  • What properties of neural representations enable good planning?
  • What aspects of biological learning (learning rules, modularity, consolidation etc.) enable the formation of these representations? How can we train artificial models to attain these properties?

  • How do planning and the use of model-based learning interact and complement more model-free behaviour?

  • What are the benefits and limitations of planning as inference vs. search-based planning in both natural and artificial systems? How do these trade-offs differ across model types, task structure, prediction space etc.?

Is the LLM + scaling approach sufficient? Time will tell; and if current rates of investment and concentration of talent around the problem are anything to go by, we may find out very soon. If not, we will have to wait for other areas such as world models with latent state predictions to gain traction and mature. Stock markets and LinkedIn might brand you a heretic for betting against LLMs, but I feel strongly that more explicit world modelling will occupy a place in future AI stacks. Of course, this is a false dichotomy, fueled by the demands of debate in the social media age. In truth the road to generally capable artificial intelligence is a long and varied one, likely with a few more twists and turns to come, and almost certainly one that will require multiple approaches to traverse.