AI Wiki / How LLMs Work
DEEP DIVE

How LLMs Work

Tokens, training, context windows, hallucinations. How language models actually process your questions and generate answers. No PhD required.

technical infographics

What is a language model?

Infographic showing how an LLM processes a question through tokenization, pattern matching, and next-token prediction

A Large Language Model (LLM) is software that has learned the statistical patterns of language by reading billions of pages of text. It does not understand your question the way a person does. It predicts what text is most likely to come next, based on everything it has seen before. Think of it as the world's most sophisticated autocomplete. When you type a message in your phone and it suggests the next word, that is the same principle, scaled up by a factor of several billion.

The 'Large' in LLM refers to the number of parameters: the internal dials the model uses to make predictions. GPT-4 has an estimated 1.8 trillion parameters. Claude, Gemini, and Llama have hundreds of billions each. More parameters generally means the model can capture more nuanced patterns, but it also means more compute, more cost, and more energy.

KEY INSIGHT

An LLM does not look up answers in a database. It generates them word by word based on probability. This is both its strength (it can be creative, flexible, conversational) and its weakness (it can be confidently wrong).

Tokens: how AI reads text

LLMs do not read words the way you do. They split text into tokens: chunks that can be whole words, parts of words, or even single characters. Common words like 'the' or 'and' are one token. Longer or rarer words get split into pieces. The word 'understanding' becomes two or three tokens depending on the model.

EXAMPLE: HOW TEXT BECOMES TOKENS

The sentence "AI is transforming logistics" splits into:

AI is transform ing log istics

Blue = full word tokens. Purple = sub-word tokens. 4 words became 6 tokens.

On average, one English word equals roughly 1.3 tokens. A page of text (about 750 words) is around 1,000 tokens. This matters because everything in the AI world is priced and measured in tokens: how much text you can send, how long a reply you get, and what it costs. When a model says it has a '200K context window,' that means roughly 150,000 words.

WHY IT MATTERS

Non-English languages are typically less token-efficient. A Dutch sentence often costs 20-40% more tokens than the English equivalent, because tokenizers were primarily trained on English text. This means higher costs and faster context exhaustion when working in Dutch.

Training: from raw text to useful assistant

Building an LLM happens in two major phases. Understanding these explains a lot of the behavior you see when you use one.

THE TWO PHASES OF LLM TRAINING
PHASE 1
Pre-training
Read the internet. Trillions of tokens from books, websites, code, scientific papers. Learn the patterns of language.
PHASE 2
Fine-tuning
Learn to be helpful. Human trainers write example conversations and rate responses. The model learns what good answers look like.

Pre-training is the expensive part: it takes months on thousands of specialized GPUs and costs tens of millions of dollars. This is where the model absorbs its 'knowledge.' But a pre-trained model is not yet useful. It can finish sentences, but it does not know how to have a conversation or follow instructions.

Fine-tuning teaches the model to be an assistant. Techniques like RLHF (Reinforcement Learning from Human Feedback) train the model to prefer helpful, harmless, and honest responses. This is why ChatGPT can chat with you, while the raw GPT-4 base model would just continue whatever text you started.

KNOWLEDGE CUTOFF

Because training data is collected at a fixed point in time, every model has a knowledge cutoff date. Events after that date are unknown to the model unless it can search the web. This is why AI sometimes gives outdated answers, and why web-connected AI assistants are more reliable for current information.

Context windows: AI's short-term memory

The context window is the total amount of text an LLM can hold in its 'working memory' during a single conversation. Everything you send and everything it replies with counts toward this limit. Once the window is full, the model starts losing the beginning of the conversation.

CONTEXT WINDOW SIZES (AS OF EARLY 2026)
Gemini 2.5 Pro / Claude Sonnet 4.6 -- 1M
~750,000 words 1,000,000 tokens
GPT-5.4 -- 272K
~200,000 words 272,000 tokens
Claude Opus 4.6 -- 200K
~150,000 words 200,000 tokens
GPT-4o -- 128K
~96,000 words 128,000 tokens

Here is the catch: advertised context sizes are not the full story. Research consistently shows that model performance degrades as the context fills up. A model with a 200K token window typically becomes unreliable around 130K tokens. Think of it like a desk: it might be 2 meters long, but the part you can actually work on is smaller.

PRACTICAL TIP

For long conversations, start fresh sessions instead of extending one thread forever. The model performs best when the context is not saturated. If you paste a long document, keep your instructions concise to leave room for the model to think.

How it generates text: next-token prediction

When you send a prompt, the model does not read your message and then compose a reply the way a person would. It processes your entire input at once, then generates the response one token at a time. For each token, it calculates probability scores for every possible next token in its vocabulary (typically 50,000 to 200,000 options) and picks one.

GENERATION: ONE TOKEN AT A TIME
1
Your prompt
All tokens processed in parallel
2
Calculate
Score probability for every possible next token
3
Pick one
Select next token based on scores + temperature
4
Repeat
Add token to output, go back to step 2

This is why you see AI responses appear word by word in the chat, as if it is typing. It actually is generating each piece in sequence. It is also why the model cannot 'go back and fix' something mid-sentence without special prompting. Each token is committed the moment it is generated.

Temperature: creativity vs. accuracy

Temperature is a setting that controls how the model picks from its probability scores. Low temperature (close to 0) means it almost always picks the highest-probability token, making output predictable and factual. High temperature (close to 1 or above) means it explores lower-probability options, making output more creative and varied, but also more prone to errors.

TEMPERATURE SCALE
0.0 -- Deterministic
0.3 -- Factual
0.7 -- Balanced
1.0+ -- Creative

Most AI chatbots default to 0.7 -- a balance between coherence and flexibility

Hallucinations: when AI makes things up

Because LLMs generate text based on probability rather than looking up facts, they sometimes produce information that sounds plausible but is simply false. The AI community calls this 'hallucination.' It is arguably the biggest practical challenge of using LLMs today.

Hallucination rates vary widely between models. The best performing models in early 2026 achieve rates below 1% on standardized benchmarks, while less capable models hallucinate in up to 30% of responses. In specialized domains like medicine or law, even well-performing models can fabricate references in a significant portion of their outputs.

Fabricated facts

Stating statistics, dates, or claims that do not exist. The model is pattern-matching, not fact-checking.

Fake sources

Inventing book titles, URLs, research papers, or author names that look real but do not exist.

Confident errors

Presenting wrong information with the same confident tone as correct information. No hesitation, no caveats.

HOW TO SPOT THEM

Be suspicious of very specific claims (exact percentages, specific dates, named sources) in domains where you cannot verify. Ask the model for its sources. Cross-check critical facts with a web search. If the model says 'a study found that...' without naming the study, assume it may be fabricated until you verify it yourself.

What this means for how you prompt

Understanding how LLMs work under the hood directly improves how you use them. Here are the practical takeaways.

Be specific, not vague. Since the model predicts the next most likely token, vague prompts lead to generic responses. The more context and specifics you provide, the more the probability distribution shifts toward useful output. 'Write me a marketing email' produces average text. 'Write a marketing email for a logistics company launching a route optimization tool, targeting operations managers, tone: professional but not stiff' produces something you can actually use.

Think about token budget. Every word you send eats into the context window. Do not paste your entire document if you only have a question about one paragraph. Be concise in your instructions, but thorough in your context. This is not a contradiction: it means cutting filler while keeping the information the model needs to do its job.

Verify, do not trust. Knowing that LLMs generate text probabilistically, not factually, should permanently change your relationship with AI output. Use it as a first draft, a starting point, a thinking partner. Never use it as a source of truth without checking. The people who get the most value from AI are the ones who treat it as a capable but fallible collaborator.

Structure helps. LLMs respond well to structure in prompts because structured text has clear patterns. Use numbered lists for multi-step instructions. Use headers to separate sections. Tell the model what format you want the output in. The clearer the pattern you establish, the better the model continues it.