Your phone sends a short text. Somewhere in a data center, an AI spins up and, in a blink, makes not one choice but billions of tiny ones—just to decide its next word. Yet it never “sees” the whole sentence. How does something so blind to the big picture sound so fluent?
You’re watching the reply appear on your screen, one fragment at a time, but under the hood something much smaller than “a word” is doing the real work. Modern models don’t actually think in words at all—they think in tokens. A token might be a whole word, half a word, punctuation, or even a single character, depending on how often it shows up in real text. Common pieces like “ing”, “pre”, or “tion” get their own entries, while rare names might be split into several chunks. Each of these chunks is turned into numbers, shuffled through layers of computation, and turned back into text. This matters because it sets a hard budget: every question and answer must fit inside a fixed token window. Go past that, and earlier parts of the conversation start to fall off the edge, changing what the model can “remember” and how it responds.
Behind the scenes, each token lives as a long list of numbers called a vector—a kind of GPS coordinate in a huge abstract space. Nearby coordinates tend to mean related ideas: “doctor” might sit closer to “nurse” than to “banana.” During training, the model repeatedly adjusts these coordinates so that contexts predicting similar continuations pull tokens into meaningful clusters. This is where “understanding” emerges: not from rules, but from patterns of proximity. When you ask a question, your prompt reshapes this space on the fly, nudging which regions become more likely for the next step.
Once your prompt has been nudged into that high‑dimensional landscape, the model has to do something brutally concrete: assign a probability to every possible next token it could output.
Here’s what actually happens at each step.
First, your existing tokens are slammed through many Transformer layers. Each layer lets tokens “look at” one another via attention, so the model can weigh which parts of the prompt matter most right now. The result is a dense internal summary of “given everything so far, what seems likely next?”
That summary is then fed into a giant final matrix often called the output layer. Mathematically, it’s just a table of numbers connecting the model’s internal state to every token in its vocabulary. Multiplying by this matrix produces a raw score for each candidate token—tens of thousands at once.
Those raw scores are turned into probabilities using a softmax function: higher scores become higher probabilities, but every option still gets some nonzero chance. Tweaking temperature changes how sharply the model prefers top choices: low temperature hugs the favorites; higher temperature spreads probability mass further down the list, allowing more surprising continuations.
Now comes sampling. The system doesn’t have to take the single most likely token. It can: - pick strictly the top one (greedy decoding), - sample according to the full distribution, - or restrict itself to a “nucleus” of tokens whose probabilities add up to, say, 90% (top‑p sampling).
That choice controls style as much as content. Deterministic decoding tends to sound safe and repetitive; more stochastic strategies create variation, but also risk nonsense.
Crucially, this process repeats. After one token is chosen, it’s appended to the context, everything is recomputed, and a fresh distribution over next tokens is produced. A short reply may require only a hundred such cycles; a long essay might require thousands, each step re‑evaluating context, recalculating scores, and rolling the probabilistic dice again.
When you ask for a recipe, the model doesn’t secretly pull a stored “lasagna post” from a database; it rebuilds a brand‑new version, token by token, each time. That’s why the same prompt can yield a tight bullet list once, and a story‑like answer the next, especially if temperature or sampling settings change behind the scenes.
You can see this most clearly in places where style matters more than facts. Ask three times for a startup pitch: each run leans on similar themes and jargon, but the phrasing, rhythm, even the hook can shift because different high‑scoring tokens get picked.
Building on this adaptability observed in coding tools, future systems may plan larger chunks of text at once, mixing retrieval of facts with on‑the‑fly phrasing, so a model could both “read” a long medical record and draft a nuanced summary in one pass.
Like a chef tasting after every ingredient and adjusting seasoning on the fly, the model constantly re‑evaluates where it’s heading, one tiny decision at a time, until your final answer appears.
Future systems may plan larger chunks of text at once, mixing retrieval of facts with on‑the‑fly phrasing, so a model could both “read” a long medical record and draft a nuanced summary in one pass. That changes who does the slow work of sifting documents: you, or the model. It also raises new questions: if responses emerge from millions of tiny numerical nudges, where do we draw the line between a tool you steer and a collaborator you must actively supervise?
So when you watch a reply stream out, you’re really seeing thousands of microscopic bets resolving in real time. The twist is that small tweaks—like a longer prompt, a different temperature, or a few extra examples—can bend that cascade in useful ways. Your challenge this week: change one setting at a time and watch how the “voice” shifts.

