How LLMs Actually Think

Tokens, Embeddings, and the Foundations of AI Engineering

This is Post 1 of the AI Engineer Roadmap, an 8-part series where I break down AI engineering concepts with real-world analogies and practical code. If you haven't read the intro, start here.


If you've ever typed a prompt into ChatGPT or Claude and thought "okay but what is actually happening right now?", this post is for you.

Before we build anything in this series (and we will, we're building an AI Code Review Bot from scratch), we need to understand the building blocks. What are tokens? What does "embedding" actually mean? Why does the model sometimes make things up with total confidence?

These aren't academic questions. If you're going to build AI-powered features into real products, you need this foundation. Everything in the next 7 posts, from RAG to agents to production systems, depends on understanding what's in this one.

How each section works: Every concept below follows a pattern. First, I explain the idea in plain terms with a real-world analogy. Then a Bot angle shows how it applies to the Code Review Bot we're building across this series. Code examples follow where relevant. And finally, a EL11 block explains the same thing the way you'd explain it to an eleven-year-old. Feel free to skip around based on what works for you.

Let's get into it.


1. Tokens: How LLMs Read

Here's something that tripped me up early on. LLMs don't read words. They read tokens.

When you type I love building apps, the model doesn't see four words. It breaks that sentence into tokens, small chunks that might be words, parts of words, or even single characters. The sentence above might become something like: I, love, building, apps. Four tokens, roughly mapping to words.

But it gets weird fast. Take a function name like getUserById. To us, that's one thing, a function name. To the tokenizer, it's something like: get, User, By, Id. Four tokens for one identifier.

Now try the same logic in Python: get_user_by_id. Because of the underscores, it tokenizes differently: get, _user, _by, _id. Same concept, different token count.

And it gets even weirder with non-English text. A simple Tamil sentence might use 3–4x more tokens than the English equivalent, because the tokenizer was trained primarily on English text. This isn't a fun trivia fact. It directly affects cost (you pay per token) and performance (more tokens means slower responses).

The LEGO analogy: Think of tokens like LEGO bricks. Language gets broken down into small, standardised pieces. English words are like common 2x4 bricks, they're everywhere in the set, so they map neatly to single tokens. Uncommon words or non-English text are like those weird angled pieces. The tokenizer needs to combine multiple smaller bricks to represent them.

Bot angle: When the bot reads your code, it's consuming tokens. A 500-line file might be 2,000 to 3,000 tokens. A full codebase of 50,000 lines could be 150,000 or more. This matters because every model has a limit on how many tokens it can handle at once (more on that in Section 4). It also matters for cost, since every token in and out costs money.

javascript
// What we see:
        const getUserById = async (userId) => {
        const user = await db.users.findOne({ _id: userId });
        return user;
        };

        // What the LLM sees (approximately):
        // ["const", " get", "User", "By", "Id", " =", " async", " (",
        //  "userId", ")", " =>", " {", "\n", "  const", " user", " =",
        //  " await", " db", ".users", ".find", "One", "({", " _id", ":",
        //  " userId", " })", ";", "\n", "  return", " user", ";", "\n", "};"]
        // That's ~33 tokens for 4 lines of code.

Try it yourself: Head to tiktokenizer.vercel.app (opens in new tab), paste some code, then paste a paragraph of English, then try some text in another language. Watch the token counts. It's genuinely eye-opening.

EL11: Imagine you're sending a text message, but your phone can only understand a fixed dictionary of word-pieces. Common English words like "the" or "hello" are already in the dictionary as one piece. But a long or unusual word like "unbelievable" might need to be split into "un", "believ", "able". Three pieces. Every piece is a token. The more pieces your message needs, the longer it takes to send and the more it costs. That's basically what happens when an LLM reads your text.


2. Embeddings: Numbers That Capture Meaning

This one took me a while to really get, but once it clicked, everything else in AI engineering started making more sense.

An embedding is a list of numbers that represents the meaning of something. Not the text itself. The meaning.

When the model processes the token getUserById, it doesn't just store the letters. It converts it into a long list of numbers, something like [0.12, -0.84, 0.33, 0.67, ...] (in reality, this list is hundreds or thousands of numbers long). This list of numbers is the embedding.

Here's why this matters. Things that are similar in meaning end up with similar numbers. getUserById and fetchUserById do the same thing, they retrieve a user. Their embeddings would be very close to each other in this number space. But getUserById and banana? Those embeddings would be far apart.

The Spotify analogy: Think about how Spotify recommends music. You never told Spotify "I like songs with 120 BPM, minor key, electronic drums, and melancholic vocals." But it figured it out. How? Spotify converts every song into a list of numbers based on its characteristics: tempo, energy, danceability, acousticness, and dozens of other features. Songs that are close together in that number space sound similar. Songs far apart sound different. When you like a song, Spotify finds other songs nearby in that space.

Embeddings do the exact same thing with text and code. Instead of tempo and energy, the numbers capture things like "is this about data retrieval?" or "does this involve error handling?" or "is this authentication-related?" The model learns these dimensions during training. We don't define them manually.

Bot angle: This is the foundation of how the bot will understand code semantically. In the next post, we'll use embeddings to build a search system where the bot can find relevant code in your codebase. Not by matching keywords, but by understanding meaning. Someone asks "how does authentication work?" and the bot finds the right files even if none of them contain the word "authentication."

javascript
// Conceptually, this is what embeddings enable:

// These two are semantically similar, so embeddings are close together
const embedding1 = await getEmbedding("getUserById");    // [0.12, -0.84, 0.33, ...]
const embedding2 = await getEmbedding("fetchUserById");  // [0.11, -0.82, 0.35, ...]
// Distance between them: very small (≈ 0.03)

// This one is semantically different, so the embedding is far away
const embedding3 = await getEmbedding("banana");         // [-0.71, 0.45, -0.22, ...]
// Distance from embedding1: very large (≈ 1.47)

// This is how semantic search works. Find code that's "close" to the question.
const queryEmbedding = await getEmbedding("How do we fetch user data?");
const similarCode = await findClosestEmbeddings(queryEmbedding, codebaseEmbeddings);
// Returns: getUserById, fetchUserById, loadUserProfile, etc.

EL11: Imagine every book in your school library has a secret code, a long row of numbers that describes what the book is about. A mystery novel might be [0.9, 0.1, 0.8, 0.2]. Another mystery novel would have similar numbers. A science textbook would have completely different numbers. Now if someone says "I want something like Harry Potter," you don't need to read every book. You just find the ones whose secret code is closest to Harry Potter's code. That's what embeddings do. They turn words, sentences, and code into number-codes so a computer can find things that are similar in meaning.


3. Transformers and Attention: Why Context Matters

You've probably heard the word "transformer" thrown around. It's the "T" in GPT (Generative Pre-trained Transformer). But what does it actually mean?

I'll be honest, I spent way too long trying to understand the full architecture before realising I didn't need to. For building AI-powered products, you need the intuition, not the math. Here's what clicked for me.

Before transformers, language models processed text one word at a time, in order. They'd read "The cat sat on the ___" and predict "mat", but they were bad at understanding relationships between words that were far apart in a sentence.

Transformers introduced something called attention, the ability to look at the entire input at once and figure out which parts are most relevant to each other.

The PR review analogy: Think about how you review a pull request. You don't read every line with equal focus. You skim the import statements, probably fine. You slow down at the business logic, that's where bugs live. You really focus on database queries and authentication code, that's where security issues hide. You're paying different amounts of attention to different parts based on how important they are.

Transformers do something remarkably similar. When processing the sentence "The server crashed because the database connection pool was exhausted," the attention mechanism learns that "crashed" is strongly related to "pool was exhausted", even though they're far apart in the sentence. It connects the cause and effect.

For code, this is especially powerful. When the model reads a function, attention helps it understand that a variable defined on line 3 is being used on line 47, that an error thrown in one module is caught in another, that a function's return type matters for how it's called elsewhere.

Bot angle: Attention is what allows the bot to understand code in context, not just line by line. It can spot that a variable is used but never initialised, or that a function's error handling doesn't match the errors it could actually throw. Without attention, the model would just be doing pattern matching on individual lines.

EL11: Imagine you're reading a mystery novel and trying to figure out who the villain is. You don't treat every sentence equally. You pay extra attention to the suspicious clues, the weird alibi, the character who showed up at the wrong time. Your brain connects dots across different pages. That's what "attention" does for AI. When it reads a sentence or a piece of code, it figures out which parts are connected to which other parts, even if they're far away from each other. It's like having a highlighter that automatically marks the important relationships.


4. Context Windows: The Memory Limit

Here's a concept that seems simple but has massive practical implications.

Every LLM has a context window, the maximum number of tokens it can see at one time. Think of it as the model's working memory. Claude's context window is about 200,000 tokens (roughly 150,000 words). GPT-4 varies by version but ranges from 8K to 128K tokens.

Sounds like a lot, right? It is, until you start building real things.

The meeting analogy: Imagine you're in a long meeting. For the first 30 minutes, you remember everything: who said what, the decisions made, the action items. After an hour, things start getting fuzzy. After two hours, someone references "what we agreed on at the beginning" and you're quietly hoping someone took notes. That's a context window. The model can hold a lot in its head, but there's a hard ceiling.

Bot angle: This is where things get real. Let's say your codebase is 50,000 lines of code. That's roughly 150,000 to 200,000 tokens. You might be thinking "great, Claude can handle 200K tokens, just dump it all in." But there are three problems with that.

First, you need room for the prompt itself, plus the model's response. If the context window is full of your code, there's no space left for the actual review.

Second, more tokens means more cost. Sending 200K tokens for every review gets expensive fast.

Third, and this is the important one, models don't treat all parts of a long context equally. Research has shown that information in the middle of a very long context tends to get less attention than information at the beginning or end. So even if it technically fits, cramming everything in doesn't guarantee good results.

This is exactly why we'll need RAG in the next post. Instead of dumping the entire codebase into the prompt, we'll use embeddings to find just the relevant files and feed only those to the model. Smart retrieval beats brute force.

javascript
// The brute force approach (don't do this)
const entireCodebase = fs.readFileSync('all-50000-lines.js', 'utf-8');
const response = await claude.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{ role: "user", content: `Review this code:\n${entireCodebase}` }]
});
// Problems: expensive, slow, might exceed context window,
// model loses focus on middle sections

// The smarter approach (we'll build this properly in the next post)
const changedFiles = getFilesFromPR(pullRequest);
const relevantContext = await findRelatedCode(changedFiles); // Uses embeddings!
const response = await claude.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{
    role: "user",
    content: `Review these changes:\n${changedFiles}\n\nRelevant context:\n${relevantContext}`
  }]
});
// Cheaper, faster, more focused review

EL11: Imagine you have a desk, and you can only spread out 20 pages at a time. If you're working on a school project and you need information from a 200-page book, you can't put all 200 pages on your desk at once. You have to pick the 15 to 20 most useful pages, spread those out, and work with them. If you pick the right pages, you can still do a great job. If you pick random pages, your project will be a mess. An LLM's context window is like that desk. It can only "see" a limited amount at once, so choosing what goes in matters a lot.


5. Why LLMs Hallucinate

This is probably the most important concept for anyone building AI-powered products.

LLMs don't "know" things the way humans do. They predict the most likely next token based on patterns they learned during training. When you ask "What is 2+2?", the model isn't doing math. It's recognising that in its training data, the sequence "2+2=" is overwhelmingly followed by "4". So it outputs "4". Correct answer, but for different reasons than you might think.

Now here's where it gets dangerous. Ask the model about a function called validateOrderFluxCapacitor(), something that obviously doesn't exist. The model will confidently explain what it does, describe its parameters, maybe even suggest improvements. Why? Because the pattern "when asked about a function, explain what it does" is so strong in its training data that it overrides the fact that this function is completely made up.

The confident friend analogy: We all have that one friend who never says "I don't know." Ask them anything, obscure history, niche science, cooking tips, and they'll give you an answer with complete confidence. Sometimes they're right. Sometimes they're making it up. But they never tell you which one it is. LLMs are that friend. They generate the most plausible-sounding response, whether or not it's accurate.

Bot angle: This is the core problem we need to solve. If the bot hallucinates during a code review, confidently pointing out a "bug" that doesn't exist or suggesting a fix that breaks something else, developers will stop trusting it within a week. This is why RAG (Post 2) is essential. By grounding the bot in your actual codebase and coding standards, we dramatically reduce hallucination. The bot stops guessing and starts referencing real code.

EL11: Imagine you have a really smart parrot that has listened to millions of conversations. If you ask it a question, it doesn't actually understand the answer. It just repeats what it's heard people say after similar questions. Most of the time, it sounds right because it's heard the right answer a thousand times. But sometimes, especially for unusual questions, it just says whatever sounds most natural, even if it's completely wrong. And here's the tricky part: it says wrong things in the exact same confident voice as right things. That's what hallucination is. The AI isn't lying. It's doing its best pattern-matching, but sometimes the pattern leads to nonsense.


6. Temperature: The Creativity Dial

Last concept for this post, and it's a fun one.

When an LLM generates a response, it doesn't just pick the single most likely next token every time. It calculates a probability distribution: "token A has a 60% chance, token B has 25%, token C has 10%..." and then picks from that distribution.

Temperature controls how it picks.

At temperature 0, it always picks the most likely token. Every time you run the same prompt, you get the same output. Predictable, consistent, but sometimes boring or repetitive.

At temperature 1, it samples more freely from the distribution. Less likely tokens have a real chance of being picked. The output is more creative and varied, but also more unpredictable.

At temperature 2 (if the model supports it), it's basically rolling dice. Outputs can get weird, incoherent, or surprisingly creative.

The radio analogy: Think of temperature like a radio dial. Turn it all the way to the left (temperature 0) and you get the news. Factual, predictable, no surprises. Turn it to the middle (temperature 0.5 to 0.7) and you get a decent talk show. Mostly makes sense, occasionally says something unexpected. Turn it all the way to the right (temperature 2) and you get late-night experimental radio. Might be brilliant, might be complete chaos.

Bot angle: We'll want low temperature (0 to 0.3) for code reviews. When the bot says "this function has a potential null pointer exception on line 47," we want that to be consistent and reliable, not creative. But if we add a brainstorming feature later, something like "suggest three alternative approaches to this architecture," we might bump the temperature up to get more diverse suggestions.

javascript
// Low temperature: deterministic, consistent code review
const review = await claude.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  temperature: 0,
  messages: [{
    role: "user",
    content: `Review this function for bugs and security issues:\n${codeSnippet}`
  }]
});
// Run it 10 times, same review every time. Good for code review.

// Higher temperature: varied, creative suggestions
const ideas = await claude.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  temperature: 0.8,
  messages: [{
    role: "user",
    content: `Suggest 3 alternative architectures for this module:\n${codeSnippet}`
  }]
});
// Run it 10 times, different suggestions each time. Good for brainstorming.

Try it yourself: Take the same prompt and run it with temperature 0, then 0.5, then 1.0. Watch how the output shifts from robotic consistency to creative variation. It's one of those things that's way more intuitive once you see it in action.

EL11: Imagine you're picking what to have for lunch. At temperature 0, you always pick your absolute favourite: pizza. Every single day. Boring but reliable. At temperature 0.5, you usually pick pizza but sometimes try pasta or a burger. At temperature 1, you might try sushi, a weird salad, or something you've never heard of. And at temperature 2, you might end up eating cereal for lunch just because it sounded interesting in the moment. Temperature is how much the AI is willing to go off-script and try less obvious answers.


Bringing It All Together

Let's take stock of where we are. After this post, here's what we understand about how the Code Review Bot will work at a foundational level:

Tokens → The bot reads code as tokens. A 500-line file is roughly 2,000 to 3,000 tokens. This affects speed, cost, and what fits in the context window.

Embeddings → The bot can understand that getUserById and fetchUserById are similar concepts. This will power semantic code search in the next post.

Attention → The bot can understand relationships across code. A variable defined on line 3 is used on line 47. An error in one file relates to handling in another.

Context windows → We can't dump the entire codebase in. We need to be smart about what we send. This sets up RAG in the next post.

Hallucination → Without grounding, the bot will confidently make things up. We need retrieval (Post 2) and eventually fine-tuning (Post 4) to fix this.

Temperature → For reviews, we keep it low and deterministic. For suggestions, we can dial it up.

Right now, our bot can "read" code, but it knows nothing about your project, your coding standards, or your team's patterns. It's reviewing code in a vacuum.

Next post, we fix that.

In the next post, we'll build a RAG pipeline that teaches the bot your codebase. We'll use embeddings to create a searchable index of your code, so when the bot reviews a PR, it retrieves relevant context first: your coding standards, related modules, past review patterns. Then it generates a review grounded in real knowledge, not guesses.

The difference between a generic AI code reviewer and one that actually understands your project? That's RAG.


Next up: Post 2: RAG - Teaching Your Code Review Bot to Actually Know Your Codebase

This is part of the AI Engineer Roadmap, an 8-part series on building AI-powered products. Start from the beginning.