RAG - Giving LLMs Your Knowledge

This is Post 2 of the AI Engineer Roadmap, an 8-part series where I break down AI engineering concepts with real-world analogies and practical code. If you haven't read Post 1, start there. This one builds directly on it.

At the end of the last post, I said: "Right now our bot can read code but knows nothing about your project. Next post, we fix that."

This is that post.

The problem we left with is real. An LLM trained on the public internet knows a lot about programming in general. It knows what a REST API is, it can spot a missing null check, it understands async/await. But it has no idea that your team never uses moment.js because you migrated to date-fns six months ago. It doesn't know your auth middleware always expects a specific header. It doesn't know that UserService.findById is the only safe way to query users in your codebase.

Without that context, the bot is guessing. And as we covered in Post 1, it guesses confidently.

RAG, retrieval-augmented generation, is how we fix this. Instead of relying on what the model was trained on, we fetch relevant information first, then ask the model to reason over what we fetched. The model doesn't need to "know" your codebase. It just needs to read it.

How each section works: Same pattern as last time. Plain explanation with an analogy, then a Bot angle on how it applies to our Code Review Bot, code where relevant, and an EL11 block for the eleven-year-old explanation.

1. Why RAG Exists

Think about the difference between a closed-book exam and an open-book exam. Same student, same intelligence, same reasoning ability. But on the closed-book exam, you're limited to what you memorised. On the open-book exam, you can look things up. For specific factual questions, the open-book student wins almost every time.

A base LLM is a closed-book exam. It knows what it learned during training and nothing more. RAG gives it a book to open.

The pattern is: before asking the model to generate anything, you retrieve relevant information from your own knowledge base and include it in the prompt. The model then generates a response grounded in that retrieved information, not just its training data.

Bot angle: Before generating a code review, the bot searches your indexed codebase for context related to the PR. If the PR changes a function in userService.js, the bot retrieves other files that interact with that module, any coding standards you've documented, and past patterns in similar files. Then it generates a review grounded in your actual codebase, not some generic version of what it thinks good code looks like.

EL11: Imagine you have a test tomorrow about your school's specific cafeteria menu, the exact prices, the days each dish is served. If you study in advance but then take the test with no notes, you'll probably get some things wrong. But if you're allowed to bring the actual menu to the test, you'll get everything right. RAG is giving the AI the menu. Instead of guessing what your specific codebase looks like, it looks it up first.

2. But What About Large Context Windows?

This is a real debate in the AI engineering world right now, and I think it's worth addressing directly because the "RAG is unnecessary" argument sounds convincing until you think it through.

The argument goes: Claude has a 200K token context window. GPT-4o supports 128K. Why bother with a retrieval system when you can just dump your entire codebase into the prompt?

For small, self-contained tasks, that argument has some merit. Formatting a single file? Just send the file. Explaining a specific function? Include the function and its dependencies. You don't need RAG for everything.

But code review at any real scale breaks this approach in a few ways.

Cost. Sending 150,000 tokens in every request is expensive. If your team opens 20 PRs a day and each review costs $0.50 in token costs, that's $10 a day, $3,650 a year, just for the input tokens. RAG sends only what's relevant, maybe 5,000 to 10,000 tokens per review instead of 150,000.

Needle in a haystack. Models don't treat all parts of a long context equally. Research has consistently shown that information buried in the middle of a very long prompt gets less attention than information at the beginning or end. The model can technically "see" everything, but it doesn't attend to all of it with equal weight. Your critical coding standard documented somewhere in a 200K token dump might as well not be there.

Hallucination doesn't go away. A bigger context window doesn't stop the model from misreading or misinterpreting what you sent. It just means there's more for it to get wrong.

Re-feeding is wasteful. Your codebase doesn't change with every PR. If you dump the entire thing each time, you're paying to send the same unchanged files over and over. RAG indexes once and retrieves on demand.

Bot angle: Tools like Cursor use RAG under the hood. When you ask it to explain or change something, it's not sending your entire project to the model. It's searching your indexed codebase for relevant files and including only those. Claude Code takes a different approach, it acts more like an agent that reads files on demand using tools, which we'll get to in Post 3. Different approaches, same underlying motivation: smart selection beats brute force.

EL11: Imagine your teacher asks you a question about the Roman Empire. You have a 10,000-page encyclopedia. Even if you're allowed to use it, you wouldn't read all 10,000 pages before answering. You'd go to the index, find the relevant sections, and read just those. RAG is the index. The context window is the encyclopedia. Having a bigger encyclopedia doesn't mean you should read all of it every time.

3. Chunking: The Step Everyone Skips

Most RAG tutorials jump straight to embeddings and vector search. But chunking is where most RAG systems actually fail, and it's worth slowing down here.

You can't embed an entire file as one unit. Files are too large and too varied. A 500-line service file contains dozens of distinct ideas: individual functions, error handling patterns, data transformation logic. If you embed the whole file as one vector, retrieval becomes imprecise. You might retrieve a file because one function matches, but the retrieved chunk includes 450 lines of unrelated code.

So you split files into smaller pieces first. Each piece gets its own embedding. Retrieval finds the most relevant piece, not the most relevant file.

The question is: how do you split?

Line-based chunking is the naive approach. Cut every N lines. Fast to implement, bad for code. A function split across two chunks loses meaning at the boundary. The model retrieves half a function, misses the other half, and the review suffers.

Function-boundary chunking is better. Detect where functions and classes start, cut there. Each chunk is a complete logical unit. Retrieval precision improves significantly because you're searching for functions, not arbitrary line ranges.

Semantic chunking goes further. Use a model to decide where to cut based on meaning shifts in the code. Most accurate, significantly more expensive, usually overkill unless your retrieval quality is already high and you're optimising from there.

One trick worth knowing: overlap. When you cut at boundaries, repeat the last 10 to 20 lines of one chunk at the start of the next. This prevents context loss at the seam. A function call at the bottom of chunk 3 stays connected to the function definition at the top of chunk 4.

Bot angle: For the code review bot, we chunk by function and class boundaries. Each chunk is one logical unit of code. When a PR changes getUserById, we retrieve the chunk containing that function, not the entire file it lives in.

javascript

// Line-based chunking (naive — don't do this for code)
function chunkByLines(content, chunkSize = 50) {
  const lines = content.split('\n');
  const chunks = [];
  for (let i = 0; i < lines.length; i += chunkSize) {
    chunks.push(lines.slice(i, i + chunkSize).join('\n'));
  }
  return chunks;
}
// Problem: a function split across chunk boundaries loses meaning

// Function-boundary chunking (much better for code)
function chunkByFunctions(filePath, content) {
  const lines = content.split('\n');
  const chunks = [];
  const boundaryPattern = /^(export\s+)?(async\s+)?function\s+\w+|^class\s+\w+/;
  const boundaries = [];

  lines.forEach((line, i) => {
    if (boundaryPattern.test(line)) boundaries.push(i);
  });

  for (let i = 0; i < boundaries.length; i++) {
    const start = boundaries[i];
    const end = boundaries[i + 1] ?? lines.length;
    chunks.push({
      filePath,
      content: lines.slice(start, end).join('\n'),
      startLine: start + 1,
      endLine: end,
    });
  }
  return chunks;
}

EL11: Imagine you're cutting a book into pieces to put in different filing cabinets. You could cut every 50 pages regardless of where chapters end. But then some chapters would be split across two cabinets, which makes them confusing to read. Better to cut at chapter boundaries, so every piece is a complete idea. Chunking code at function boundaries is the same thing. Each piece you store is a complete thought.

4. Embeddings as the Retrieval Mechanism

We covered embeddings in Post 1: numbers that capture meaning, similar things close together in vector space. Here's where we put that to work.

At index time, you embed every chunk. Each chunk becomes a list of numbers representing its meaning. You store that list alongside the chunk content.

At retrieval time, you embed the query, the PR diff or code snippet you're reviewing. Then you find the stored embeddings closest to the query embedding. "Closest" means semantically similar, not keyword-matching.

This is why you can search for "how do we handle database errors" and retrieve a chunk containing catch (MongoError e) even though none of your search terms appear in the code. The meaning is close in embedding space.

One thing that isn't obvious: not all embedding models are equal for code. General-purpose models were trained mostly on natural language, so their sense of "similar" is calibrated for prose. Code has different patterns, identifier naming conventions, structural similarities between functions, type relationships. Models like voyage-code-3, trained specifically on code, understand these patterns better and produce meaningfully better retrieval quality.

Bot angle: When a PR comes in, we embed the changed files and run a similarity search against the indexed codebase. The retrieved chunks are the most contextually relevant pieces of your project: related functions, similar patterns, files that interact with the changed code. The model reviews the diff with that context in hand.

javascript

import VoyageAI from 'voyageai';

const voyage = new VoyageAI({ apiKey: process.env.VOYAGE_API_KEY });

// Indexing: embed chunks with "document" mode
async function embedChunks(chunks) {
  const response = await voyage.embed({
    input: chunks.map(c => c.content),
    model: 'voyage-code-3',
    inputType: 'document', // optimised for storage
  });
  return chunks.map((chunk, i) => ({
    ...chunk,
    embedding: response.data[i].embedding, // 1024 numbers
  }));
}

// Retrieval: embed the query with "query" mode
async function embedQuery(prDiff) {
  const response = await voyage.embed({
    input: [prDiff],
    model: 'voyage-code-3',
    inputType: 'query', // optimised for similarity search
  });
  return response.data[0].embedding;
}

// The document vs query distinction matters.
// The model calibrates differently depending on which
// direction the similarity comparison is going.

EL11: Remember from last post: every piece of code gets converted into a long list of numbers based on what it means. Now imagine you have thousands of these number-lists stored in a database, one for each piece of code in your project. When you want to find code related to a PR change, you convert the change into a number-list too, then find the stored number-lists that are most similar. You're not matching words, you're matching meaning.

5. Vector Databases

A regular database finds things by exact match or range: give me all users where role = 'admin', give me orders placed after January 1st. The data is organised by IDs, indexes on specific fields, sorted lists.

A vector database finds things by similarity. Given this query vector, find me the N stored vectors closest to it. The data is organised so that similar vectors are near each other in the index structure.

Under the hood, vector databases use approximate nearest neighbor algorithms to make this fast. You don't need to understand the internals to use it. What matters is that searching across millions of vectors takes milliseconds, instead of comparing your query against every stored vector one by one.

There are dedicated vector databases like Pinecone, Weaviate, and Qdrant. There are also extensions to existing databases, pgvector for Postgres, Atlas Vector Search for MongoDB.

Bot angle: We use MongoDB Atlas Vector Search because it's the same database we're already running. No new infrastructure. You create a vector search index on the chunks collection, point it at the embedding field, and Atlas handles the similarity search.

javascript

// Retrieve top 5 most relevant chunks for a PR diff
async function retrieveRelevantChunks(prDiff, topK = 5) {
  const queryEmbedding = await embedQuery(prDiff);

  const results = await collection.aggregate([
    {
      $vectorSearch: {
        index: 'embedding_index',
        path: 'embedding',
        queryVector: queryEmbedding,
        numCandidates: topK * 10, // search wider, return narrower
        limit: topK,
      },
    },
    {
      $project: {
        _id: 0,
        filePath: 1,
        content: 1,
        startLine: 1,
        endLine: 1,
        score: { $meta: 'vectorSearchScore' },
      },
    },
  ]).toArray();

  return results;
}

EL11: Imagine a library where the books aren't organised by title or author, but by topic, and books about similar things are physically shelved next to each other. Fantasy novels are near other fantasy novels. Science books cluster with other science books. If you walk in holding a page from a book and want to find similar books, you don't search the whole library. You find where that page would live on the shelves and look at what's nearby. A vector database is a library organised by meaning.

6. The Full RAG Pipeline

Let's put the pieces together. The pipeline has two phases.

Indexing (runs once, re-runs when the codebase changes):

Walk the codebase, collect all code files
Chunk each file by function and class boundaries
Embed each chunk using voyage-code-3
Store chunks and their embeddings in MongoDB Atlas

Review (runs on every PR):

Receive the PR diff
Embed the diff using voyage-code-3 in query mode
Search Atlas for the top 5 most semantically similar chunks
Build a prompt: system instructions + retrieved context + diff
Send to Claude, receive a grounded review

javascript

import Anthropic from '@anthropic-ai/sdk';

const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

async function reviewPullRequest(prDiff) {
  // Step 1: retrieve relevant context
  const relevantChunks = await retrieveRelevantChunks(prDiff, 5);

  // Step 2: format context for the prompt
  const contextBlock = relevantChunks
    .map(chunk =>
      `// ${chunk.filePath} (lines ${chunk.startLine}-${chunk.endLine})\n${chunk.content}`
    )
    .join('\n\n---\n\n');

  // Step 3: generate a grounded review
  const response = await claude.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    system: `You are a code reviewer with access to the relevant codebase context.
Review the diff. Reference specific lines. Ground your feedback in the context provided.
Flag bugs, pattern inconsistencies, and maintainability issues. Skip compliments.`,
    messages: [{
      role: 'user',
      content: `Relevant codebase context:\n\n${contextBlock}\n\nPR diff to review:\n\n${prDiff}`,
    }],
  });

  return response.content[0].text;
}

EL11: The pipeline is like a research assistant helping you write a report. You give them the new information (the PR diff). They go to the library (the vector database), find the most relevant books (the similar chunks), and bring them back. Then you sit down with the new information and the relevant books, and write the report (the code review). Without the library trip, the report is based only on what you already knew. With it, it's grounded in real, specific knowledge.

7. When NOT to Use RAG

Most RAG tutorials don't include this section. I think that's a mistake, because RAG is one of those things that gets applied to problems it doesn't fit.

Don't use RAG when the context fits in the prompt. If you're reviewing a single file under 10,000 tokens, just include it. Retrieval adds latency and complexity. If you don't need it, skip it.

Don't use RAG when you need exhaustive coverage. RAG retrieves a sample of relevant chunks. It might miss the one chunk that matters most. For tasks where missing any relevant piece is a real problem, retrieval alone isn't reliable enough.

Don't use RAG when your data changes faster than you can re-index. If your codebase is constantly changing and your index is stale, retrieval returns outdated context. Stale vectors are often worse than no vectors, because the model confidently uses wrong information.

Don't confuse RAG with fine-tuning. This one tripped me up early on. Fine-tuning changes how the model behaves. RAG changes what the model knows at inference time. If your team has a specific review style, or the bot keeps flagging something your team does intentionally, that's a prompting or fine-tuning problem. RAG is for knowledge, not behavior. More context won't fix a behavioral mismatch.

Bot angle: RAG is the right call for our bot. The codebase is too large to fit in every prompt, it changes slowly enough that daily re-indexing is fine, and what we're adding is knowledge (your codebase), not behavior (how to write reviews). But if the bot keeps flagging your team's intentional use of a specific library as an error, RAG won't fix that. We'll cover that distinction in Post 4 when we get to fine-tuning.

EL11: A calculator is great at math. But if someone asks you what time it is, you don't reach for a calculator. RAG is great at giving an AI access to specific knowledge it wasn't trained on. But if the AI is giving wrong answers because it doesn't understand your team's conventions or style, more knowledge doesn't fix that. Different problem, different tool.

Where the Bot Stands Now

After this post, the Code Review Bot has gone from reviewing code in a vacuum to reviewing code with real context.

We index the codebase into MongoDB Atlas using function-boundary chunking and voyage-code-3 embeddings. When a PR comes in, we embed the diff, retrieve the five most relevant chunks via vector search, and pass those to Claude alongside the diff. The review is now grounded in your actual codebase, not a guess.

But there's still a lot of manual wiring. Someone has to trigger the review. The bot processes files in bulk rather than deciding which ones need closer attention. It doesn't post comments back to GitHub. It doesn't know when to look for more context or when five chunks is enough.

That's what agents are for.

In the next post, we give the bot autonomy. It receives a PR webhook, decides which files matter, retrieves context for each one, runs the review, and posts comments directly to GitHub, without any human prompting at each step. The bot stops being a function you call and starts being something that acts on its own.

Next up: Post 3: AI Agents - Making LLMs Do Things

This is part of the AI Engineer Roadmap, an 8-part series on building AI-powered products. Start from the beginning.