← Complete Beginner Path

Under the Hood: How LLMs Actually Work

A plain-English explanation of how transformers, tokens, context windows, and hallucinations work — and what that means for how you use AI.

beginner 15 min read Updated Mar 2026
YOUR PROGRESS
2 of 6 ← Prev Next →

Under the Hood: How LLMs Actually Work

This section is optional. You can skip it and go straight to picking a platform.

If you’re the kind of person who likes to understand how things work, this will give you a mental model that makes everything else click. If you just want to get to the practical stuff, that’s fine too.


How Transformers Work, in Plain English {#transformers}

Before 2017, AI language models worked like someone reading a book one word at a time, left to right. By the time they got to the end of a long sentence, they’d mostly forgotten the beginning.

Then came Transformers, and nothing was ever the same.

The Breakthrough: Self-Attention

The key innovation is called self-attention. Here’s what it does: every word in a sentence looks at every other word and figures out how relevant they are to each other. For an interactive visualization of how this works, see the Transformer Explainer project from Poloclub.

Consider this sentence: “The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to? The trophy or the suitcase?

A self-attention mechanism lets the model see that “big” is more closely connected to “trophy” than to “suitcase” in this context, so “it” probably refers to the trophy. The model learns these patterns by seeing millions of similar examples during training.

Why This Was a Game-Changer

Before Transformers, models processed words sequentially. They struggled with long-range dependencies. By the time the model reached the end of a paragraph, it had lost track of the beginning.

Transformers process all words simultaneously, and every word can attend to every other word regardless of distance. This means:

  1. Parallel processing: Instead of word 2 waiting for word 1, the model processes everything at once. This made it possible to train on vastly more data.

  2. Long-range understanding: A word can directly connect to another word 500 tokens away. No more “out of sight, out of mind.”

  3. Context awareness: The model builds a rich representation of how each word relates to every other word in the input.

This architecture is the foundation of GPT, Claude, Gemini, and every other modern language model. When people say “large language model,” they mean “a Transformer that’s been trained on a lot of data.”


What Tokens Are and Why They Matter {#tokens}

You’ve probably heard that AI models work with “tokens” rather than words. Here’s what that actually means and why it matters.

Subword Tokenization: The Middle Path

Models don’t read text character-by-character, and they don’t read word-by-word either. They use something in between called subword tokenization. The most common approach is Byte Pair Encoding (BPE).

BPE works by starting with individual characters and gradually merging the most frequently occurring pairs. After enough merges on enough training data, you end up with a vocabulary where:

  • Common words are single tokens: “the”, “and”, “cat”
  • Less common words get broken into parts: “strawberry” might become [“straw”, “berry”] or [“st”, “raw”, “berry”] depending on the vocabulary
  • Very rare words get split into chunks: “unhappiness” might be [“un”, “happi”, “ness”]

This is a compromise between two extremes. Character-level models would need absurdly long sequences and wouldn’t capture meaning. Word-level models would need impossibly large vocabularies and couldn’t handle words they hadn’t seen before. Subword tokenization solves both problems. For a deeper dive into how tokenization works in practice, see OpenAI’s tokenizer.

Why Token Count Isn’t Word Count

This matters practically because AI tools charge per token and have token limits. But token count doesn’t map cleanly to word count.

  • English: roughly 0.75 words per token on average
  • Dense technical writing: might be 0.6 words per token (more complex vocabulary gets split more)
  • Code: much more variable, but often 2-3 tokens per line of simple code
  • Languages with different writing systems: Chinese is around 2-3 characters per token, while some Romance languages can be more token-efficient than English

The word “strawberry” is a classic example. Depending on the model’s tokenization, it might be one token, two tokens, or three. This isn’t trivia. It affects cost and context limits. You can see how any text gets tokenized using OpenAI’s interactive tokenizer tool.

Practical Implications

  1. Cost: If you’re paying per token, the same document might cost more or less depending on how the model tokenizes it.

  2. Context limits: A 200,000 token context window might hold roughly 150,000 words of English text, but only 60,000-80,000 words of dense technical content.

  3. Language differences: The same document will cost different amounts to process in different languages.

  4. Prompt efficiency: Writing “don’t” vs “do not” can change token count. “Hello, how are you?” is 5 tokens. “Hi, how’s it going?” might be 7 or 8.

You don’t need to obsess over this, but it explains why you sometimes hit context limits unexpectedly, and why the same prompt costs different amounts across different platforms. For more on managing costs across different platforms, see Cost Management.


Context Windows: The Model’s Working Memory {#context-windows}

Every AI model has a context window. This is the amount of information it can hold in its “head” at one time. Think of it as working memory.

How Context Windows Work

When you send a message to an AI, you’re not just sending that message. You’re sending the entire conversation history, plus any documents you’ve attached, plus any instructions you’ve given. All of this gets tokenized, and all of those tokens count against the context window.

Once you exceed the context window, the model literally cannot see the earliest parts of the conversation. They’ve fallen off the edge. Different platforms handle this differently. Some might keep the most recent messages and discard the oldest. Others might try to summarize. But the hard limit is real.

Current Context Window Sizes (February 2026)

Context windows have grown enormously. Here’s where things stand:

  • Gemini 3.1 Pro: 1 million tokens standard. That’s roughly 750,000 words, or enough to hold 10-15 full-length academic papers in one go.

  • Claude: 200,000 tokens standard for Sonnet 4.6. Opus 4.6 has a 1 million token context window in beta for enterprise and high-tier users. That’s about 750,000 words, or entire codebases.

  • GPT-5.2: Specific context window sizes vary by model variant, but generally range from 128,000 tokens for everyday use up to 1 million tokens for specialized applications.

To put this in perspective: a typical book is about 100,000 words. A 1 million token context window can hold 7-8 books in memory at once.

Why Context Windows Matter

  1. Document processing: Large context windows mean you can drop an entire PDF, legal contract, or research paper into the conversation and ask questions about it without chunking it manually.

  2. Conversation memory: More context means the model remembers more of your earlier conversation. This matters for long, detailed discussions.

  3. Code analysis: Developers can feed entire codebases to the model and ask it to make changes that are consistent across the whole project. See Agentic AI or Building Apps Without Coding for more on code-focused AI workflows.

  4. Multi-step tasks: Agents working through complex tasks need to remember instructions, intermediate results, and the overall goal. Large context windows make this more reliable.

The Plateau

Context windows grew aggressively through 2024 and 2025, but we’re seeing a plateau around 1 million tokens. There are diminishing returns to making them larger, and there are technical challenges. The model gets slower as context grows, and the “needle in a haystack” problem, where the model struggles to find specific information buried in a huge context window, doesn’t fully go away.

For most users, 200,000 tokens is plenty. If you’re routinely exceeding that, you probably know who you are.


How LLMs Generate Text {#text-generation}

Here’s something most people don’t realize: language models don’t plan ahead. They don’t draft an outline and then fill it in. They generate text one token at a time, based on probability.

One Token at a Time

The process looks like this:

  1. You give the model a prompt: “Once upon a”

  2. The model calculates the probability of every possible next token in its vocabulary. Given the training it has seen, “time” is highly probable. “dark” is also probable. “banana” is extremely improbable.

  3. It samples from that probability distribution. If the temperature is low, it picks the most likely option. If the temperature is higher, it might pick something less likely but more interesting.

  4. Now the prompt is “Once upon a time”. The model repeats the process for the next token.

  5. This continues until the model decides the response is complete, or until it hits a token limit.

The model is always just predicting the next token based on everything that came before. It has no concept of the whole. It doesn’t know where it’s going. It’s just predicting what comes next, one step at a time, over and over.

Temperature: Controlling Creativity vs. Reliability

Temperature is a parameter that controls how “random” the model’s token selection is. For more details on how temperature affects outputs in practice, see prompt engineering techniques.

  • Temperature 0.0: The model always picks the single most likely token. This is maximum determinism. Good for code generation, factual responses, anything where you want consistency.

  • Temperature 0.7: The model introduces some randomness. It might pick the third or fourth most likely token sometimes. This makes the output more varied and interesting. Good for creative writing, brainstorming, anything where you want exploration.

  • Temperature 1.0+: The model picks less likely tokens more frequently. Output becomes more diverse but also more likely to drift. Good for highly creative tasks where weirdness is a feature, not a bug.

For platform-specific documentation on temperature and other sampling parameters, see Claude’s documentation or OpenAI’s API reference.

Here’s the key: the model isn’t deciding “I should be creative.” It’s just randomly sampling from a probability distribution. All the intelligence is in that distribution, which the model learned during training. Temperature just determines how tightly you stick to the most probable path.

The Implications

This one-token-at-a-time generation explains some things about how models behave:

  1. No lookahead: The model doesn’t know that the token it’s about to pick will force it into a corner three tokens from now. It just knows what’s probable right now.

  2. Repetition loops: If the model picks a token that makes the previous token highly probable again, it can get stuck in loops. “And then and then and then.”

  3. Loss of coherence: Long generated text can drift because the model loses track of the overall structure. It’s always just looking at what came immediately before.

  4. Why instructions matter: Clear instructions at the start of the prompt shape the probability distribution for every subsequent token. The model “remembers” your instructions because they’re in the context window.

Everything about how these models work is filtered through this simple loop: predict the next token, append it, repeat.


Why Hallucinations Happen {#hallucinations}

AI models confidently make things up. This is called hallucination, and it’s not a bug. It’s a direct consequence of how the technology works. For practical strategies on detecting and handling hallucinations, see Managing AI Output Quality.

Not a Database, a Pattern Matcher

When you ask a model a factual question like “What year did the first iPhone come out?”, the model is not looking up the answer in a database. It’s generating text based on patterns it saw during training.

During training, the model saw documents that mentioned the iPhone and its release year. It learned associations between “iPhone”, “first”, “release”, “2007”, and other related concepts. When you ask it about the iPhone release, it generates text that fits the pattern.

Most of the time, this works fine. The pattern matches reality. But sometimes the model stitches together patterns that don’t correspond to actual facts. It might say the iPhone came out in 2006, because it saw 2006 in a lot of similar contexts during training. The pattern feels right, even though it’s wrong. For more on hallucination causes and research, see Lakera AI’s resources on AI hallucinations or IBM’s research on LLM reliability.

No Built-In Fact-Checking

Models have no mechanism to verify whether what they’re generating is true. They’re just continuing patterns. The training process optimizes for text that looks like the training data, not for factual accuracy.

If the training data contained misinformation, the model learned those patterns too. If you ask about something that wasn’t in the training data at all, the model will still generate something that sounds plausible. It can’t say “I don’t know” unless it specifically learned to do that during training.

The Incentive to Guess

In 2025, OpenAI published research showing that models are often incentivized to guess rather than admit uncertainty. During training, models are rewarded for producing plausible completions. Saying “I don’t know” doesn’t look like the training data, which rarely contains such disclaimers. So the model learns that confident guessing is more likely to be rewarded.

This is a fundamental issue with how these models are trained. They’re rewarded for fluency and plausibility, not for accuracy or humility.

Types of Hallucinations

Hallucinations show up in several forms:

  1. Factual fabrication: The model states facts that are simply wrong. Dates, names, events that never happened. Often, these are things that sound plausible in context.

  2. Logical inconsistency: The model contradicts itself. In one paragraph it says X. In the next paragraph it says not-X.

  3. Context confusion: The model mixes up different parts of what you’ve told it. You say “My cat is named Max” and later “My cat is named Spot” and it happily uses both names interchangeably.

  4. False confidence: The model sounds completely certain while being completely wrong. This is the most dangerous kind of hallucination because it’s hard to spot.

Why This Matters

You cannot treat these models as oracles. They’re not. They’re pattern-matching engines that sometimes produce accurate information and sometimes produce convincing nonsense. The only way to know the difference is to verify.

For low-stakes tasks, hallucination is annoying but not catastrophic. For high-stakes tasks, it’s potentially dangerous. Medical advice, legal guidance, financial decisions, anything where being wrong has real consequences, you need to verify independently. For research and fact-checking workflows, see AI for Research.

This isn’t going away. It’s intrinsic to how these models work. The frontier models have gotten better at reducing hallucinations, but they haven’t eliminated them, and they probably can’t. For practical verification strategies, see Managing AI Output Quality.


Putting It All Together {#summary}

If you take one thing from this section, it should be this: everything connects back to tokens.

  • Transformers process all tokens in parallel, letting every token attend to every other token.

  • Context windows are token limits. More tokens equals more working memory.

  • Generation happens one token at a time, with each token chosen based on probability.

  • Tokenization determines cost, context efficiency, and sometimes even performance across languages.

The other big picture: probability, not certainty. These models don’t know things. They assign probabilities to tokens based on training data. This is why they’re powerful and why they hallucinate. Both come from the same source.

Understanding this helps set realistic expectations. These tools are incredibly capable within their design envelope. They’re not going to become reliable fact-checkers or oracles. They’re not supposed to be.

The practical skill is knowing where they’re reliable and where they aren’t. Use them for drafting, brainstorming, summarizing, explaining, coding. Verify anything factual. Double-check anything important. Treat them as a collaborator who’s fast and knowledgeable but sometimes confidently wrong.

That’s the mental model. Now let’s get into the platforms and the practical skills.