What the Machine Still Cannot Do

It writes legal briefs, composes music, and passes medical licensing exams. It cannot be genuinely surprised. The distance between those two facts contains everything we don't understand about intelligence.


By early 2026, the list of things large language models can do has become difficult to keep current. They pass medical licensing exams, write functional legal briefs, compose music that trained musicians cannot reliably distinguish from human work, and generate code that ships to production. Three years ago, GPT-4 passing the bar exam was headline news. Now it barely registers. The benchmarks keep falling, and each new model — Claude, Gemini, whatever arrives next quarter — pushes the frontier further.

The response has followed a predictable arc. Some declared that artificial general intelligence had arrived. Others insisted it was still just autocomplete. Both sides were wrong then, and they are still wrong now. The interesting question was never whether these systems are impressive. It was what, specifically, they cannot do — and whether that gap is closable.

The latest model can generate a poem about grief that makes you pause. It has never grieved. It can explain the rules of chess and play at a grandmaster level. It does not experience the particular dread of watching your position collapse on the board, the sickening recognition that you missed something eight moves ago.

These are not poetic observations. They are structural facts about what these systems are and are not doing. And they matter more than the benchmarks.


The Fifty-Year Warning

In 1972, the philosopher Hubert Dreyfus published What Computers Can’t Do, one of the most reviled books in the history of artificial intelligence. The AI community despised it. Researchers at MIT mocked Dreyfus publicly, challenged him to chess matches, and dismissed his arguments as Luddite philosophy. Seymour Papert reportedly said that Dreyfus misunderstood everything about computation.

Dreyfus’s central argument was not that computers were slow, or that they lacked data, or that their architectures were wrong. His argument, drawn from Heidegger and Merleau-Ponty, was that human intelligence is fundamentally embodied — that knowing how to do something is not the same as having a representation of how to do it, and that most of what we call understanding depends on having a body that moves through the world, feels resistance, gets tired, and cares about outcomes.

A chess master does not evaluate positions by running through decision trees. She sees the board. The dangerous configuration jumps out the way a face jumps out of a crowd — not through analysis but through a perceptual skill refined by thousands of hours of bodily engagement with the game. Dreyfus called this “intuitive expertise” and argued that no formal system could replicate it because it is not formal. It lives in the body’s learned relationship with its environment.

Fifty years later, large language models have proven Dreyfus both wrong and right. Wrong, because formal systems turned out to be far more capable than he predicted. Right, because the specific thing he said they couldn’t do — understand — they still can’t. They have merely become so good at the surface patterns of understanding that the distinction has become harder to see.


The Chinese Room, Revisited

John Searle proposed his Chinese Room thought experiment in 1980, and it has been irritating computer scientists ever since. A person sits in a sealed room, receiving Chinese characters through a slot. They have a rulebook — an enormous set of instructions for which characters to send back in response. From outside, the room appears to understand Chinese. From inside, the person is just following rules. They don’t speak Chinese. They don’t know what any of the symbols mean. They are performing symbol manipulation without semantics.

The standard objection is the “systems reply” — maybe the person doesn’t understand Chinese, but the whole system does. Searle’s counter was elegant: let the person memorize the rulebook. Now they are the whole system. They still don’t understand Chinese.

The latest generation of language models is the most sophisticated Chinese Room ever built. Each processes tokens — fragments of text represented as numbers — according to patterns learned from several trillion words of human writing. When it produces a paragraph about the phenomenology of grief, it is selecting tokens that are statistically likely to follow other tokens in contexts where humans have written about grief. The output is often beautiful. The process has no experiential content whatsoever.

This is not a controversial claim among the researchers who build these systems. The training process has been described as “learning a world model through next-token prediction.” The question is whether that world model constitutes understanding or merely a very high-resolution map. A map of London is not London. No matter how detailed you make it, it will never rain on it.


What Surprise Requires

Gary Marcus, cognitive scientist at New York University, has spent the past decade cataloguing the specific failure modes of large language models. Not the hallucinations — those are well-known and, in principle, fixable. The deeper failures. The ones that reveal what the architecture cannot do.

In a series of papers with Ernest Davis, Marcus has presented language models with simple physical reasoning problems that any five-year-old could solve. A ball is placed on a table. The table is tilted. Where does the ball go? The models could answer this — it had seen thousands of similar descriptions. But when the scenario was subtly modified — the ball is glued to the table, the table is tilted, where does the ball go? — the model’s performance degraded sharply. Not because the problem was hard, but because the answer required understanding physical causation rather than matching textual patterns.

This is the signature of a system that has learned correlations, not causes. And it points to something fundamental about human cognition that language models lack: the capacity for genuine surprise.

Surprise is not an emotion. It is an epistemic event — the moment when reality violates a prediction your body had committed to. You reach for a cup you expected to be heavy and it’s empty; your arm flies up. A friend says something that contradicts everything you thought you knew about them; your stomach drops. These responses are not decorative. They are the mechanism by which the mind updates its model of the world. Surprise requires having a model that can be wrong in a way that matters to you. It requires stakes.

A language model has no stakes. When it generates an incorrect prediction, no signal of wrongness propagates through its system. It does not flinch. It does not pause. The error is corrected in the next training run, statistically, across billions of parameters. But the model itself never experiences the vertigo of being wrong — and that vertigo, developmental psychologists like Alison Gopnik have argued, is the engine of genuine learning.


The Body Problem

There is a reason that children learn about gravity by dropping things, not by reading about dropping things. Embodied cognition — the theory that the mind is shaped by the body’s interactions with the physical world — has moved from a fringe position to something close to consensus in cognitive science over the past two decades.

The work of George Lakoff and Mark Johnson, beginning with Metaphors We Live By in 1980 and formalized in Philosophy in the Flesh in 1999, showed that even our most abstract concepts are grounded in bodily experience. We understand time as movement through space. We understand argument as physical combat. We understand importance as weight. These are not literary choices. They are the deep structure of human thought, and they emerge from having a body that moves, lifts, and navigates.

A language model has access to all the metaphors. It can use them fluently. But it has never felt weight. It has never been lost. It has never experienced the particular cognitive shift that happens when you walk into a room and forget why you came — a phenomenon, incidentally, that researchers at the University of Notre Dame have shown is caused by the doorway itself, by the body’s passage through a boundary, which triggers an “event boundary” in episodic memory. The mind is not separate from the rooms it moves through. The architecture of space shapes the architecture of thought.

This is not a limitation that more data will fix. It is not a limitation that multimodal training — adding images, audio, video — will fix, though it will help with surface performance. The limitation is structural. A system that has never had a body cannot understand in the way that embodied creatures understand, for the same reason that a perfect recording of a symphony is not a performance. Something is happening in the room that the recording cannot capture. Not the sound. The resonance.


Neurosymbolic Futures

The most interesting work in AI right now is happening at the seam between two traditions that have spent fifty years ignoring each other.

The symbolic AI tradition — the one Dreyfus attacked — built systems that manipulated logical rules. They could reason but couldn’t learn. The connectionist tradition — neural networks, deep learning, transformers — builds systems that learn but can’t reason, at least not reliably. Today’s large language models are the apex of the connectionist approach. Their failures are connectionism’s failures: brittle reasoning, no causal models, no genuine abstraction.

Neurosymbolic AI, championed by researchers like Josh Tenenbaum at MIT and Yoshua Bengio at Mila, attempts to combine both — neural networks for pattern recognition and learning, symbolic systems for structured reasoning and causal inference. Tenenbaum’s group has built models that learn intuitive physics the way infants do, not from text but from observation, building causal models of how objects interact. These systems can be surprised. They have expectations, and those expectations can be violated.

This is a fundamentally different project from scaling up language models. It is an attempt to build not a better mirror of human text but a system that constructs models of the world and tests them against experience. Whether it will succeed is an open question. But it is asking the right question — not “how do we generate more convincing text?” but “what would it take for a machine to understand?”


The Gap That Matters

There is a particular kind of knowledge that resists formalization. A nurse who has worked in an ICU for twenty years knows when a patient is about to deteriorate — not from the vitals, which still look fine, but from something she cannot name. A musician knows when a note is right not because it matches the theory but because it feels inevitable. A mother knows her child is lying before the child has finished the sentence.

This knowledge is real. It is reliable. It saves lives. And it is precisely the kind of knowledge that large language models cannot have, because it is not extracted from text. It is extracted from life — from years of embodied, emotional, consequential engagement with a domain. Patricia Benner, studying nursing expertise in the 1980s, called it “knowing-in-practice” and showed that it could not be reduced to rules or protocols. It lived in the body of the practitioner and died with them unless transmitted through apprenticeship — another embodied practice.

These models pass the bar exam. They pass medical boards. They write code that compiles. These are extraordinary technical achievements, and they deserve to be recognized as such. But these exams test the kind of knowledge that can be written down. The kind that can’t — the kind that lives in the pause before a surgeon cuts, in the silence a therapist holds, in the moment a teacher decides to abandon the lesson plan — that kind of knowledge is not closer to being automated than it was fifty years ago.

We have built a machine that can do everything with language except mean it. That is not a failure. It is a mirror, showing us exactly which parts of intelligence are pattern and which are something else. Something we still cannot name, which might be the most important thing about us.


HOLI — What comes next