What is an LLM? How do Large Language Models work?

Every time you ask ChatGPT to fix your code, get Gemini to summarise a meeting, or let Claude draft an email, you're talking to an LLM. But most explanations of what an LLM actually is either drown you in math or stay so vague they're useless.

I've spent the last two years building with these tools, watching them fail in production, and figuring out why. This article is the plain English breakdown I wished existed when I started. By the end, you'll know exactly what an LLM is, how it works step by step, why it sometimes makes things up, and which ones are worth your time in 2026.

No PhD needed.

What Is an LLM?

💡 Quick Answer: An LLM (Large Language Model) is an AI system trained on massive amounts of text to understand and generate human language.

It works by predicting what word comes next, billions of times across trillions of sentences, until it gets remarkably good at sounding like it understands you.

ChatGPT, Claude, Gemini, Grok, and DeepSeek are all LLMs. The "large" refers to both the training data (trillions of words) and the model size (billions of internal parameters).

Think of it this way. Imagine you read every book, article, forum post, and website ever written. After a few years, you'd get very good at predicting what comes after "The capital of France is ___". You wouldn't necessarily understand Paris. But you'd know the answer.

That's roughly what an LLM does, at a scale no human could match.

The term breaks down simply:

Large, trained on trillions of words, with billions of internal settings (parameters)
Language, it works with text: reading it, understanding patterns, generating responses
Model, it's a mathematical system, not a database or search index

The technical name for the architecture underneath almost every LLM today is the Transformer, introduced in a 2017 Google paper called "Attention Is All You Need." Every major model since then, from GPT to Claude to LLaMA, runs on a version of this same architecture.

LLM Data Flow - Large Language Model Data Processing

How Does an LLM Actually Work?

💡 Quick Answer: An LLM works in two phases.

First, training: it reads trillions of words and adjusts billions of internal numbers (weights) to get better at predicting the next word.

Second, inference: when you send a message, it converts your text into numbers, passes them through dozens of layers of calculations, and generates a response one word at a time.

Each word it generates depends on everything that came before it in your conversation.

Here's the step-by-step of what happens when you type something and hit enter:

Step 1: Your text becomes tokens The model doesn't read words, it reads chunks called tokens. "ChatGPT" might be one token. "uncharacteristically" might be four. More on this in the next section.

Step 2: Tokens become numbers (embeddings) Each token gets converted to a long list of numbers that encodes its meaning. Words with similar meanings end up as similar numbers. "king" and "queen" end up closer together than "king" and "refrigerator."

Step 3: Attention happens The model figures out which words should influence which other words. In the sentence "The cat sat because it was tired", it needs to figure out that "it" means the cat, not the mat. Self-attention is the mechanism that resolves this.

Step 4: Layers of processing Your input passes through dozens (sometimes hundreds) of layers, each one refining the model's understanding of meaning and context.

Step 5: The model predicts the next token After all that processing, the model outputs a probability score for every token in its vocabulary. "Paris" might get 87%. "London" might get 6%. It samples from these probabilities to pick the next word.

Step 6: Repeat The chosen word gets added to the conversation, and the whole process runs again for the next word. This is why responses stream out word by word, because they're literally being built one token at a time.

LLM Working Diagram - Large Language Model Working Diagram with Explantion

What Are Tokens and Why Do They Matter?

💡 Quick Answer: Tokens are the basic units an LLM processes, not words, but chunks of text. A token is roughly 3-4 characters on average in English. For Example "running" is 1 token. "uncharacteristically" is 4 tokens.

This matters because LLMs have a token limit (called a context window), which is the total amount of text they can hold in memory at once.

Claude's context window is 200,000 tokens. GPT-4o's is 128,000. Go over the limit and the model starts forgetting the beginning of your conversation.

Most people don't think about tokens until something breaks. Here's when it actually matters:

Context window limits. When you paste a 50-page PDF into an LLM and it starts giving weird answers halfway through, it's often because you've hit the context limit and the model is no longer seeing the beginning of the document.

Non-English languages use more tokens. English is efficient to tokenize. Hindi, Arabic, and Japanese often take 2-3x more tokens for the same information. This means the same conversation costs more and fits less into the context window.

Code and numbers tokenize oddly. The number "10000" might be 3-4 tokens. This is one reason LLMs are unreliable at arithmetic, they're not processing digits the way a calculator would.

A rough rule: 1,000 tokens is about 750 English words.

llm claude chat bot hit limit error diagram gif

How Is an LLM Trained?

💡 Quick Answer: LLM training happens in three stages. Pretraining: the model reads trillions of words and learns to predict the next token (this costs millions of dollars and takes months).

Fine-tuning: it's trained on instruction-response examples to learn how to be helpful.

Alignment (RLHF): human raters compare responses and the model learns to produce answers people actually prefer. GPT, Claude, and Gemini all go through these three stages before you ever talk to them.

Think of training an LLM like training a very talented but socially inexperienced person for a job.

Phase 1: Pretraining (Learning everything) The model reads a significant chunk of the internet, Common Crawl, Wikipedia, GitHub, books, research papers, and tries to predict the next word at every single step. When it's wrong, it adjusts its internal numbers slightly. Do this across 15 trillion tokens and you get a model that has absorbed an enormous amount of human knowledge.

This phase costs tens of millions of dollars for frontier models. GPT-4 reportedly cost over $100 million to train.

Phase 2: Supervised Fine-Tuning (Learning to be useful) The raw pretrained model is weird. Ask it a question and it might generate three more questions instead of answering. So trainers feed it thousands of instruction-response pairs:

Instruction: "Explain recursion simply."
Response: "Recursion is when a function calls itself..."

After enough of these, the model learns the pattern: question in, helpful answer out.

Phase 3: RLHF, Reinforcement Learning from Human Feedback (Learning what humans prefer) This is what separates a helpful assistant from a technically-capable but annoying model. Human raters look at pairs of responses and choose the better one. A reward model learns from these choices. Then the LLM is trained to produce outputs the reward model scores highly.

This is why Claude and ChatGPT decline certain requests, hedge their answers, and try to be balanced, those are all behaviours reinforced during RLHF.

What Makes an LLM Different from a Search Engine?

💡 Quick Answer: A search engine finds and ranks existing pages. An LLM generates new text from scratch based on patterns in its training data. Google returns links to pages that already exist. ChatGPT writes a response that has never existed before.

The tradeoff: search engines can show you the source and let you verify it. LLMs generate confident-sounding text that may or may not be accurate, and they don't always tell you which.

This is probably the most important thing to understand about using LLMs effectively.

	Search Engine	LLM
What it does	Finds existing pages	Generates new text
Source of information	Live web index	Training data (frozen)
Can you verify it?	Yes, click the link	Not directly
Up-to-date information	Yes	Only with web search tools
Best for	Finding a specific thing	Explaining, writing, reasoning
Failure mode	Returns irrelevant results	Generates confident wrong answers

I learned this the hard way. I once asked an LLM to find me a specific npm package and it gave me a package name that didn't exist, with a confident description of what it does. The package sounded real. It wasn't. That's called Hallucination, and it's important enough to get its own section.

The mental model that helps: Google is a librarian. An LLM is a very well-read friend. The librarian finds you the actual book. The friend tells you what they remember from reading it. Usually accurate. Sometimes not.

Every Major LLM in 2026: ChatGPT, Claude, Gemini, and More

💡 Quick Answer The major LLMs in 2026 are: ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), Grok (xAI), DeepSeek (DeepSeek AI), LLaMA (Meta), Mistral, and Cursor (coding-focused). Each has different strengths. ChatGPT is the most widely used. Claude is strong at long documents and nuanced reasoning. Gemini has real-time Google integration. DeepSeek and LLaMA are open-weight models anyone can run locally. Cursor is built specifically for coding workflows.

Here's the honest breakdown of who's who:

ChatGPT (OpenAI)

The one that started the mainstream wave in late 2022. GPT-4o is the current flagship, handles text, images, voice, and files. Best all-rounder for everyday use. Has the largest user base and the most third-party integrations.

Best for: General tasks, writing, coding help, broad audience

Claude (Anthropic)

Built by ex-OpenAI researchers with a focus on safety and nuanced reasoning. Genuinely excellent at long documents. Its 200,000 token context window means it can read an entire book and answer questions about it. Claude tends to be more careful with its answers than GPT.

Best for: Long-form writing, document analysis, coding, nuanced reasoning

Gemini (Google DeepMind)

Google's answer, with native integration into Search, Docs, Gmail, and Maps. Gemini's advantage is real-time web access baked in. Strong on multimodal tasks. Show it an image, a video, or a file and it handles it well.

Best for: Research with live web data, Google Workspace users, multimodal tasks

Grok (xAI)

Elon Musk's model, built into X (Twitter). Has real-time access to X's data stream, which makes it genuinely useful for following current events and public conversations. Less mature than GPT and Claude but improving quickly.

Best for: Real-time news, social media analysis, X users

DeepSeek

Chinese open-weight model that made headlines in early 2025 for matching frontier performance at a fraction of the training cost. DeepSeek-R1 is particularly strong at reasoning and math. Open weights means you can run it locally or fine-tune it yourself.

Best for: Math, coding, reasoning tasks, developers who want to run models locally

LLaMA (Meta)

Meta's open-source model family. LLaMA 3 is powerful enough to run on a good laptop. The main draw is privacy and control, meaning your data never leaves your machine.

Best for: Privacy-conscious use, developers, fine-tuning for specific domains

Mistral

French company building efficient, open-weight models. Mistral 7B punches way above its weight for its size. Good if you need fast, cheap inference.

Best for: Lightweight deployment, API cost-sensitive use cases

Cursor

Not a general-purpose LLM but worth including here. Cursor is a code editor built on top of Claude and GPT that understands your entire codebase as context. It's changed how a lot of developers work, including me.

Best for: Developers who want AI assistance while actually writing code

Why Do LLMs Hallucinate?

💡 Quick Answer: LLMs hallucinate because they generate statistically plausible text, not verified facts. The model doesn't have a "fact-checking" step, it just predicts what sounds right based on patterns in training data. When the training signal for a specific fact is weak (obscure names, specific dates, niche statistics), the model can still generate a confident wrong answer because confidence is a learned behaviour, not a signal of accuracy.

This is the thing that bites everyone eventually.

I've had an LLM cite a research paper that didn't exist. Give me a function from a library that was never in the docs. Describe an event that never happened. All with complete confidence.

Here's why it happens:

The model was trained to generate text that sounds like the kind of text it saw during training. Academic papers, Wikipedia, news articles, all written in confident, declarative sentences. So the model learned to write in confident, declarative sentences. It never learned to signal "I'm not sure about this specific fact."

Hallucination is worst when:

You ask about specific dates, statistics, or numbers
You ask for citations or sources (it will invent them)
The topic is niche or post-training-cutoff
You ask about a specific person who isn't very famous

Hallucination is less common when:

The fact appears millions of times in training data
The answer requires reasoning from context you've provided
You're asking it to explain a concept rather than recall a specific fact

The practical fix: for anything where accuracy matters, use an LLM with web search turned on, or provide the source material yourself in the context window. Don't ask it to retrieve facts. Ask it to reason about facts you give it.

What Can You Actually Use an LLM For?

💡 Quick Answer: LLMs are genuinely useful for writing and editing, summarising long documents, explaining code and debugging, drafting emails and messages, answering questions about topics it knows well, translating text, and brainstorming. They are unreliable for anything requiring precise factual recall, real-time information without web tools, exact arithmetic, or tasks where being wrong has serious consequences without a human review step.

The honest version, from actually using these tools in real projects:

High-confidence use cases:

Explaining a confusing error message or piece of code
Summarising a long PDF or document you provide
Writing a first draft of something you'll edit anyway
Translating text between languages
Rubber duck debugging, explaining your problem out loud (to the model) often surfaces the fix
Asking "what's the best way to approach X" problems

Medium-confidence, always verify:

Factual questions about things in its training data
Technical recommendations for tools, libraries, approaches
Research starting points (use as a lead, then verify)

Low-confidence, use with caution:

Anything requiring numbers, dates, citations, or specific names
Legal or medical advice (not a substitute for professionals)
Anything you can't independently verify

LLM Limitations You Need to Know

💡 Quick Answer:

The four biggest LLM limitations are: knowledge cutoffs (they don't know what happened after their training ended), no memory between sessions by default (each conversation starts fresh), context window limits (there's a cap on how much text they can process at once), and hallucination (they generate confident wrong answers without signalling uncertainty). Understanding these four limitations will save you a lot of frustration.

Knowledge cutoff. Every LLM was trained on data up to a certain date. Ask it about something that happened after that date and it either says it doesn't know or, worse, makes something up. Most frontier models now have web search to work around this, but check if it's turned on.

No persistent memory. Start a new conversation with Claude or ChatGPT and it has no idea who you are or what you discussed last time. Every session is fresh unless the product explicitly builds in a memory layer.

Context window limits. If your conversation goes on long enough, old parts start getting dropped from the model's "attention." This is why chatbots sometimes seem to "forget" what you said an hour ago.

It can't actually run code by default. When an LLM generates code, it's generating text that looks like code. It can't tell if the code runs unless it has a code execution tool attached (like ChatGPT's Code Interpreter).

It doesn't learn from your corrections. If you say "that's wrong, the answer is X", it will agree with you in that conversation, but the underlying model is unchanged. It will make the same mistake again next time.

FAQ

What does LLM stand for?

LLM stands for Large Language Model. "Large" refers to the scale of training data (trillions of words) and model size (billions of parameters). "Language Model" means it's a statistical model of language, trained to predict and generate text. In everyday language, it's the AI system behind tools like ChatGPT, Claude, and Gemini.

Is ChatGPT an LLM?

ChatGPT is a product built on top of an LLM. The underlying model is GPT-4o (OpenAI's large language model). ChatGPT is the interface. The chat window, memory features, and integrations are all wrapped around that model. The same GPT-4o model also powers Microsoft Copilot and many other products.

What is the difference between an LLM and AI?

AI (Artificial Intelligence) is the broad field covering any machine that simulates intelligent behaviour. An LLM is one specific type of AI, focused on language, trained on text, and built on the Transformer architecture. Other types of AI include image recognition systems, recommendation engines, and robotics. When people say "AI" in 2026, they usually mean LLMs specifically, but technically AI is a much wider category.

Can an LLM think?

Not in the way humans think. An LLM generates responses by predicting statistically likely continuations of text. It has no consciousness, no goals, and no internal experience. It can produce outputs that look like reasoning, and in some ways its internal processes resemble aspects of reasoning, but it doesn't "think" in any meaningful sense. Whether that distinction matters for practical purposes is a genuinely interesting philosophical debate, but for using these tools, the answer is: it behaves like it thinks, without actually thinking.

Which LLM is best in 2026?

There's no single best LLM. It depends on your use case. For general everyday use, ChatGPT (GPT-4o) remains the most capable and widely integrated. For long documents and nuanced reasoning, Claude is consistently strong. For real-time information and Google integration, Gemini. For coding specifically, Cursor (which runs on Claude and GPT). For running a model locally without sending data to anyone, LLaMA 3 or Mistral.

What is the difference between an LLM and a chatbot?

A chatbot is a general term for any software that simulates conversation, including simple rule-based systems that just match keywords to preset responses. An LLM is a specific kind of AI that generates language from learned patterns. Modern AI chatbots like Claude and ChatGPT are powered by LLMs, but not all chatbots are LLM-powered. The older chatbots on customer support websites usually aren't.

Do LLMs store my conversations?

It depends on the product and your settings. Most major LLM providers (OpenAI, Anthropic, Google) store conversation data by default and use it to improve their systems, unless you opt out. Claude.ai, ChatGPT, and Gemini all have privacy settings where you can disable training on your data. If you're sharing sensitive information, always check the privacy settings of whatever product you're using.

One Last Thing

The most useful mental shift I've made with LLMs: stop thinking of them as search engines and start thinking of them as a very well-read collaborator who sometimes misremembers things.

Use them to think out loud. Use them to draft things you'll edit. Use them to understand concepts faster than you could from scratch. But for anything where accuracy actually matters, verify it. Every time.

The tools are getting better fast. But understanding how they actually work, which is what you now do, means you'll know when to trust them and when to double-check.

What to read next: If you want to go deeper on any specific part of how LLMs work: the Transformer architecture, why RLHF matters, or how to actually use RAG in a real project, the detailed breakdowns are all on JargonIsEasy.

💡

Author Note: I've built production features using Claude, GPT-4, and Gemini APIs. The one thing nobody tells you upfront: the gap between "works in a demo" and "works reliably in production" is almost entirely about prompt design and knowing your model's failure modes. The hallucination problem is real, but it's manageable once you know what triggers it.