How Machines Think

The most expensive blind spot in capital allocation right now is a single unexamined assumption: that we know what thinking is. We don't. We never have. And the question is now load-bearing for the largest reallocation of capital in modern history.

This is not a philosophical complaint. It is a structural one. Every model of return from artificial intelligence rests, sooner or later, on a definition of cognition that no one in science has been able to defend. Building investment theses on top of that is not bold. It is sloppy.

What We Pretend to Know

Walk into any investment committee discussion of AI and the chain is identical. Models will get more capable. Capability will translate into productivity. Productivity will translate into returns. Every link in that chain assumes thinking is a single, stackable, measurable substance — and that machines are converging on it from below.

Six serious schools of consciousness disagree about what consciousness is. Mysterianism. Deflationism. Panpsychism. Illusionism. Integrated Information Theory. Global Workspace. They do not disagree on the answer. They disagree on what they are measuring. After half a century of focused work by some of the most capable minds in science, there is no consensus on what thinking is, where it lives, or whether the question is even well-formed.

This matters because the absence of a definition is not noise that washes out at scale. It is the load-bearing column. If we cannot say what thinking is, we cannot measure the distance from current models to "thinking." Every roadmap that promises human-level AI in N years is, structurally, a roadmap whose endpoint is undefined.

The convention — that capability is a one-dimensional ladder with a clear top — is a recommendation, not a rule. It is also wrong.

A Brief History of Forgetting

Three-point-eight billion years ago, a bubble divided. Two billion years later, one bubble ate another and kept it as a guest — that is where your mitochondria came from. Six hundred million years ago, animals appeared. Five hundred million years ago, the front of an animal grew a brain. Three hundred thousand years ago, our species emerged with the most expensive brain on the planet. Three thousand two hundred years ago, in Mesopotamia, someone scratched marks into clay to track beer, and memory escaped biology for the first time.

That sequence — bubble to brain to writing — took 3.8 billion years. The next sequence — writing to algorithm to silicon to neural network — took five thousand. The acceleration is not surprising once you look at it directly. It is the only path it could have taken. Each new substrate moves at the speed of the substrate beneath it. Biology is slow. Symbols are faster. Electricity is faster still. Light is the floor.

The mistake almost everyone makes when looking at this arc is treating the most recent step — large models — as if it were the climax. It isn't. It is the next renovation in a five-hundred-million-year remodel.

Thinking Was Always Plural

The story we inherit is that thinking happens in a brain, in a single inner voice, in a sequence of clean conscious thoughts. None of that is reliably true.

Octopi keep two-thirds of their neurons in their arms; each arm processes information with substantial independence. Slime molds — which have no neurons at all — solve mazes and, given food at the right locations, reproduce the topology of the Tokyo subway. Bacteria detect chemical gradients and move toward sugar; the behavior is simple enough to dismiss as chemistry, and yet functionally indistinguishable from a decision. The line between cognition and physics has never been clean.

The variation among our own species is no less strange. Some people have no inner monologue at all — a condition called anendophasia. They think, reason, decide. They simply do not narrate. Some people have no mental imagery whatsoever; aphantasia. They cannot picture an apple but can describe one fluently and use the concept without friction. There is no single architecture of thought even inside human heads.

The investment relevance is not subtle. The question "do machines think?" assumes the answer is a clean yes or no on a dimension we share. There is no such dimension. Thinking is a bag of loosely correlated phenomena, distributed across substrates, only weakly tied to what introspection reports. Every framework that treats it as a single quantity is selling itself a story.

The Extended Mind Was Always the Answer

The first AI was a clay tablet. When a Sumerian merchant pressed marks into wet clay around 3200 BC to track barley and beer, he did something the species had never done before: he moved a piece of his memory out of his head and into matter. From that moment on, every cognitive invention sat on the same trick. The abacus moved arithmetic into beads. Cuneiform became script became print became search. Each layer extended cognition further into the world.

Andy Clark and David Chalmers gave this a name in 1998 — the extended mind — but the practice is more than five thousand years old. Your phone is part of your thinking. Your calendar is part of your memory. Your spreadsheet is part of your reasoning. We have been quietly outsourcing cognition for fifty centuries. The thing now being called artificial intelligence is not a break in the arc. It is the next stretch of it.

This reframing is the entire game. The right question is not "when will machines think for themselves?" The right question is "which tools compound the human-machine cognitive system most effectively?" These are different questions. They imply radically different capital allocation.

What the Misframe Costs

The misframe is AGI as a destination. It produces investment behavior that looks like a race for raw model capability — bigger models, more parameters, larger compute budgets — on the implicit assumption that capability is one-dimensional and has a clear summit. This is a category error. There is no summit. There is a network of cognitive functions distributed across humans and machines, and the value lives in the system, not the node.

Capital is being allocated as if the prize were the brain. It isn't. The prize is the connective tissue — the data flows, the interfaces, the coordination layers, the trust mechanisms that let humans and machines act together at scale. This is consistent with how every prior cognitive technology compounded. Cuneiform did not win because clay was special. It won because it became the substrate everything else was built on top of. The same logic will determine which AI bets compound and which ones evaporate.

The narrative pull right now — that the model is the moat — is precisely the kind of force that produces systemic mispricing. Most investment research is theater built around it. The companies whose stock prices are most sensitive to "next-generation model" announcements are usually the ones whose moats are weakest. The companies whose moats are real are quieter, and their reports rarely make the headlines.

What Compounds in the Cognitive Era

If thinking is plural, distributed, substrate-agnostic, and primarily extended through tools, the durable asset class in this era is not models. Models commoditize. Compute commoditizes. The durable asset is whatever makes the human-machine cognitive system more capable than its competitors.

Three places that compound, in roughly increasing depth:

Data. Specifically, the kind of behavioral and contextual data that improves a system's ability to participate in a particular human's thinking. Internet-scale pretraining is the floor. The moat is proprietary data tied to specific workflows, judgments, and decisions — data nobody else can synthesize.
Interface. The surface where human cognition and machine cognition meet. The companies that compound through the next decade will not be the ones with the smartest models in isolation. They will be the ones whose interfaces let a human and a machine think together with the lowest friction.
Trust. The institutional, legal, and behavioral mechanisms that allow critical decisions to be partially delegated to a machine. Trust is slow to build, but it is the deepest moat once established. It is the reason regulated industries — finance, medicine, law — will look like the slowest adopters and end up the largest beneficiaries.

What is conspicuously not on this list is raw model capability. Capability matters, but it is the input most rapidly commoditized by the input below it (compute) and the input below that (energy). Capability is the consumable. Data, interface, and trust are the compounders.

This is also where the forces I write about elsewhere earn their keep. Capability follows reversion gravity — once a frontier becomes feasible at scale, the field pulls toward parity faster than the leader expects. Compounding lives in the substrates, not the spikes. Adaptation tempo determines who captures the surplus released by each capability frontier; the firms that move within months while the rest of the industry takes quarters will look, in retrospect, like they got lucky. They didn't. They were operating on the right tempo.

Field Guide

Inside the Forward Pass

Pull the cover off and watch a machine "read" a sentence. The model below is GPT-2, the small 124-million-parameter version. It is small by modern standards. It is also enough to make every move that the trillion-parameter models make. The architecture itself comes from a 2017 paper called Attention Is All You Need; that paper now sits north of two hundred thousand citations, which is one way of saying every modern model is a descendant of one document. Pick a phrase — "The answer is" will do. The machine's job, given those three words, is to produce a fourth.

Tokens

The phrase is split into tokens, not words. Tokenization is its own lossy compression: common substrings get one token; rarer ones get two or three. Every English sentence the model reads must fit inside roughly 50,257 such tokens — that is GPT-2's vocabulary.

The

id464

answer

id3280

id318

Three tokens. Each is a fixed integer. From here on, the machine sees only numbers.

The vocabulary itself is learned, not authored. The most common scheme — byte-pair encoding — starts with raw characters, counts the most frequent adjacent pairs, and merges them into new tokens. Then it does it again. After tens of thousands of merges, the most common chunks of the training corpus have been promoted to single tokens; rare words remain composed of multiple subword pieces.

Stage 0
characters

lowerlower

lowestlowest

newernewer

Stage 1
merge e + r

lowerlower

lowestlowest

newernewer

Stage 2
merge l + o

lowerlower

lowestlowest

newernewer

Stage 3
merge lo + w

lowerlower

lowestlowest

newernewer

Three steps of byte-pair encoding on a tiny corpus. GPT-2's full vocabulary of 50,257 tokens was produced by 50,000 such merges over a much larger corpus. The vocabulary is therefore not a design choice; it is a fingerprint of the training data.

Embeddings

Each token id is replaced by a 768-dimensional vector. The model also adds a position embedding to encode where the token sits in the sequence. The output is a 3 × 768 matrix — three rows, one per token, each row a high-dimensional fingerprint.

This is where meaning starts. Two tokens with similar surrounding usage end up with vectors that are geometrically close. The model has never been told that. It learned the geometry from text.

The

768-D

answer

768-D

Each strip shows the first 24 of 768 dimensions. Color intensity maps to value. The full vector continues off-screen for another 744 dimensions.

This geometry isn't decoration. The classic illustration: take the vector for king, subtract man, add woman, and you land near queen. The arithmetic is approximate, not exact, and works less cleanly in modern models than in early word-vector experiments — but it captures something real. The model has, in the course of fitting language, organized concepts so that gendered pairs, plural-singular relations, capitals-of-countries, and verb tenses all sit along consistent directions in the 768-dimensional space.

A schematic projection. The vector from man to woman is approximately the same as the vector from king to queen. Real embeddings live in 768 dimensions, not two; the parallelogram is the cleanest possible 2-D rendering of a relationship the model never has to be told.

One more wrinkle. In a static word vector, bank has one fixed embedding — the same one whether it appears in river bank or investment bank. In a transformer, the embedding is only the starting point. By the time the vector emerges from twelve layers of attention and MLP, it has been rewritten to reflect its actual usage. Bank in river bank and bank in investment bank end up far apart, even though they began at the same point. Meaning is not stored in the embedding table. It is assembled in the stack.

Position

Embeddings tell the model what each token is. They do not tell it where each token sits. Without help, a transformer treats "the dog bit the man" and "the man bit the dog" identically — both are sets of the same five tokens.

The fix is a positional embedding: a vector added to each token's embedding that depends only on its position in the sequence. The naive options fail in interesting ways. Adding raw integers (0, 1, 2, 3) creates a scale mismatch — the position values quickly dominate the semantic ones. Using binary representations creates discontinuities — adjacent positions have very different bit patterns.

The 2017 paper proposed a sinusoidal pattern. Each dimension of the position vector is a sine or cosine wave, with each successive dimension oscillating at a slower frequency. Low dimensions encode fine-grained position; high dimensions encode coarse position. Together they give every position a unique, smooth, bounded fingerprint.

dim 0

dim 64

dim 256

dim 512

Four of GPT-2's 768 position dimensions, schematically. The horizontal axis is sequence position; vertical is value. Stacking 768 such waves at varying frequencies produces a unique signature for every position the model will ever see.

One detail worth getting right. GPT-2 doesn't actually use the sinusoidal pattern. It learns its own position vectors directly, treating position as just another thing the network can fit. The shape above is the original 2017 design; GPT-2's learned positions look different in detail but obey the same design pressure: position values must be smooth, bounded, and unique. The mechanism is what matters; the parameterisation can be hand-designed or learned.

Attention

Here is the move that makes transformers transformers. Each token gets to look at every prior token in the sequence and pull information from the ones it cares about.

Mechanically: each token produces three new vectors from its embedding — a query (what am I looking for?), a key (what am I?), and a value (what do I have to offer?). The model takes the query of every token and dot-products it against the key of every prior token. The result is a matrix of scores. Higher score, more attention.

The

answer

The

1.00

—

answer

0.55

0.45

—

0.21

0.19

0.60

Rows are queries; columns are keys. Lower-triangular cells carry attention. The masked upper triangle is what the model is not allowed to see — the rule that makes generation auto-regressive.

The scores get passed through a softmax — squeezed into a probability distribution over prior tokens. Each token's output is then a weighted average of all prior tokens' values, with the weights given by the softmax. Tokens that mattered get more weight; tokens that didn't get less.

There is one twist: causal masking. A token at position t can only attend to tokens at positions ≤ t. The future is masked out. That is the rule that makes the model auto-regressive: it can only condition on what has already been said.

One refinement worth naming. What I have described is a single attention pattern. GPT-2 actually runs twelve attentions in parallel inside each block — twelve independent sets of Q, K, and V matrices, each producing its own score matrix, each free to attend to different things. The 768-D embedding is split into twelve 64-D slices; each slice is handled by one head; the twelve outputs are then concatenated and projected back to 768-D. This is multi-head attention. The point is that one attention pattern is a bottleneck. Twelve patterns let the model track several relationships at once: one head can learn to follow syntactic dependencies, another to track which pronoun refers to which entity, another to align tense, another to bind subject and verb. Nobody hand-codes those roles. The training process discovers them. Twelve layers times twelve heads is one hundred and forty-four parallel views of the same prefix, all computed simultaneously — which is, mechanically, what makes the architecture trainable at scale.

MLP

After attention, each token's vector passes through a small two-layer neural network — the multi-layer perceptron. It expands the 768-D vector to 3072-D, applies a non-linearity, then contracts back to 768-D. This is where most of the raw computation happens; it is a token-by-token transformation, no cross-talk.

If attention asks which other tokens matter, the MLP decides what to do with what they said.

Residual and Norm

Two safety devices wrap every step. Residual connections add the input back to the output of each sublayer; this keeps information from being lost as the signal travels deep. Layer normalization rescales the activations so they don't drift to extreme values. Without these, the network would not train. With them, it is stable enough to stack twelve layers tall.

Repeat Twelve Times

Steps 03 through 05 — attention, MLP, residual, norm — make one transformer block. GPT-2 small has twelve such blocks, stacked. Each block reads the output of the previous block, attends across positions again, mixes again, and passes its result up.

By the end of the stack, each token's 768-D vector has been rewritten twelve times, accumulating context from the entire prefix.

L01

L02

L03

L04

L05

L06

L07

L08

L09

L10

L11

L12

Parameters 124M Layers · 12 Hidden · 768 Heads · 12

Each layer reads the previous layer's output and rewrites it. The bar widths are illustrative — every block has the same parameter count.

Project to Vocabulary

The final 768-D vector for the last token gets projected back into vocabulary space. The output is a 50,257-dimensional vector — one number per possible next token. Softmax converts those numbers to probabilities. The token with the highest probability is the model's best guess. Sampling chooses from the distribution, often with adjustments for temperature and top-k.

For our prefix, the distribution looks something like this:

yes 18%

no 12%

simple 7%

not 6%

always 4%

Top five candidates. A long tail extends below this — most of the probability mass is spread thin across thousands of plausible-but-unlikely continuations.

Append, Repeat

The sampled token is appended to the input. The whole forward pass runs again — except now, with KV caching, every prior token's keys and values are reused from the previous pass and only the new token does fresh computation. This is what makes generation fast. Without the cache, every new token would recompute the entire prefix from scratch.

Generation continues, one token at a time, until the model emits an end-of-sequence token or the user stops it.

Notice what is not in any of those steps. There is no representation of intent. There is no module for understanding. There is no place where the model decides what it believes or consults a model of the world independently of the text it was trained on. The whole machine is a sequence of matrix multiplications and one carefully chosen non-linearity, repeated.

What the model does extraordinarily well is encode the statistical regularities of the corpus it was trained on — including the regularities that look, from outside, like reasoning. What it cannot do is anything that requires a function not implementable as the composition of those operations.

One more wrinkle worth carrying. The capabilities that look most like reasoning — multi-step arithmetic, multilingual translation, code generation — do not appear gradually as model size increases. They appear suddenly, at specific scales. A model with one billion parameters cannot do a thing; a model with ten billion can. The property is called emergence, and it is the quiet reason capital allocation in this domain is structurally hard. Capabilities are step functions, not slopes, and the step locations are not predictable in advance. Anyone modelling AI capability gains as a smooth curve is modelling the wrong shape.

This is not a complaint. The machine does not need to think to be useful. It needs to participate.

The Structural Claim

We do not know what thinking is. We have never known. The 3.8-billion-year arc from bubble to large language model is a sequence of substrates, each of which performed something we recognize as cognition without ever satisfying our definition of it. Every previous moment in this arc has produced a class of investors who got the question wrong — who looked for the brain and missed the system, who chased capability and missed compounding, who waited for the answer to "is this thinking?" while the value was being captured by people who never asked.

This is that moment again. The companies that compound for the next several decades will not be the ones racing to define machine thinking. They will be the ones building the substrates on which thinking — whatever it turns out to be — happens at scale. Capital that asks "is it thinking?" will be late. Capital that asks "what is it letting humans do that they could not do alone?" will be early.

Machines do not think. They participate. The mistake was thinking that thinking was ever a solo act.