Building a production-ready RAG pipeline

Dev.to / 5/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

LLMs suffer from context gaps (outdated knowledge, lack of codebase awareness) and may produce confident but incorrect answers when the necessary context is missing.
The article explains RAG (Retrieval-Augmented Generation) as a way to avoid retraining by performing “search, then prompt,” i.e., retrieving relevant passages from an indexed data source and supplying them to the model at query time.
The author describes building “Keystone” to ingest GitHub activity (PRs, commits, issues, discussions) and answer questions about why a codebase looks the way it does.
A first prototype concatenated large strings into prompts and worked only for small repositories; it failed on real-size repos due to context-window overflow, lost reasoning continuity, and high latency/cost.
The author concludes that once retrieval is needed at scale, RAG becomes required rather than optional, and outlines core stages of RAG (retrieval then generation).

Large Language Models (aka LLMs) have a memory problem: their knowledge stops the day their training data was cut off, they don't know your codebase, they don't know last week's tickets…

When they're missing context they don't say so… they guess, confidently. The polite term is hallucination; the less polite one is lying with style.

Retrieval-Augmented Generation (aka RAG) is how you fix that without retraining anything.

Think of it as turning a closed-book exam into an open-book one. The LLM is still the writer, but now it has a librarian: a system that fetches the right passages from your data and hands them over before the model puts pen to paper.

I built Keystone to learn this end-to-end.

Keystone does two things:

Ingest a GitHub repository's activity → every PR, commit, issue, and discussion
Answer questions about why the codebase looks the way it does.

The first prototype didn't have a retrieval system, it had a giant string.

That worked for tiny repos. On a real one (+1000 commits, +500 merged PRs, tons of issues, plus a tree of ~1,200 files) it broke in four ways at once:

The prompt blew past the context window.
The model lost the thread halfway through.
The latency hit double digits of seconds, and every answer cost ~$0.15 in tokens for a query that should cost a fraction of a cent.

That's the moment RAG stops being optional but required.

Below is what I actually built, lifted from the codebase running today 👇🏻

What RAG actually is (and what it isn't)

The clean mental model: RAG is just "search, then prompt."

You convert your data into a search index ahead of time. At query time, you look up the most relevant pieces and paste only those into the prompt. That's it!

We can define two main stages:

Retrieval: search your data and pull the chunks most relevant to the user's question.
Generation: send those chunks plus the question to a regular LLM call and let it write the answer.

Everything else; embeddings, vector databases, re-ranking, hybrid search… it exists because matching meaning is harder than matching keywords.

When to use RAG

The answer depends on facts the model doesn't have.

Private docs, your codebase, last week's tickets, anything post-cutoff or non-public.

It's also how you get grounded answers with citations, the model can point at exactly which chunk of which document it used, which is the difference between a tool you can ship to users and a demo you can't.

When NOT to use RAG

You want the model to behave differently (tone, format, reasoning style). That's a f*ine-tuning* or prompting problem, not a retrieval problem.

RAG injects knowledge, it does not change behavior.

The second mistake I see: people reach for RAG when the data is small enough to fit in a prompt. If you have 10,000 tokens of context, just paste it. RAG buys you scale at the cost of an extra layer that will leak relevance bugs into your product.

The four stages every RAG has

Every RAG in production has the same four stages, and each one breaks in its own special way if you do it naively:

Ingestion: pull data from somewhere and split it into chunks.
Embedding: turn each chunk into a vector so similarity becomes math.
Retrieval: search for the chunks closest to the user's question.
Synthesis: hand those chunks to an LLM and let it write the answer.

The next sections go through each one in order, with what I do in Keystone and what I'd warn you about.

1. Chunking: where most RAG systems fail

This is the section nobody wants to read because chunking sounds boring.

It's also the section that decides whether your RAG actually works.

The naive approach is "split text every 500 tokens." That dies for two reasons:

A PR, for example, is not just 500 tokens of one thing. It's a title, a body, a list of files, a list of commit messages, sometimes comments, discussions. Embedding them as one blob averages five different topics into one vector. Retrieval returns the wrong PR because the vector is an average of irrelevant stuff.
Not all artifacts are equal. A merged PR with five reviews carries more architectural signal than a Fix typo commit. Treating them with the same chunk size and same metadata throws away the asymmetry.

I use typed chunking; different artifact types get different chunkers, different size budgets, and different metadata:

function chunkPR(pr: IngestPullRequest): EmbeddingChunk {
  const filesStr   = pr.files.join(', ')
  const commitsStr = pr.commits.map(c => c.message).join(' | ')
  const commentsStr = pr.comments?.length
    ? ' | Comments: ' + pr.comments.map(c => `${c.author}: ${c.body}`).join(' | ')
    : ''
  const reviewsStr = pr.reviews?.length
    ? ' | Reviews: ' + pr.reviews.map(r => `${r.author}: ${r.body}`).join(' | ')
    : ''

  const raw = `[PR #${pr.number}] ${pr.title}: ${pr.body ?? ''} | Files: ${filesStr} | Commits: ${commitsStr}${commentsStr}${reviewsStr}`

  return {
    sourceId: `pr:${pr.number}`,        // <- stable, dedupable
    content:  truncate(raw, 4000),       // <- PRs get 4000 chars
    metadata: { type: 'pr', author: pr.author, number: pr.number, merged_at: pr.merged_at }
  }
}

And here's what a real chunk looks like coming out of that function:

{
  "sourceId": "pr:42",
  "content": "[PR #42] Replace REST with GraphQL for the data layer: Switched from ...",
  "metadata": {
    "type": "pr",
    "author": "wencesms92",
    "number": 42,
    "merged_at": "2025-11-14T10:22:00Z"
  }
}

Issues get their own chunker with the same shape but a smaller budget (1500 chars) and different metadata. Same pattern, different parameters.

Three things to notice:

The [PR #N] prefix is intentional. Embedding models are sensitive to what's at the front of the text, so putting the artifact type and number first lets the model anchor on it. When I tried without the prefix, the same PR ranked lower for queries like "what did PR 42 change?"
Each sourceId is stable and globally unique (pr:42, issue:7, readme:root, topology:tree). That key is what makes the upsert work, and it's also what lets a webhook re-embed a single PR after a merge without rebuilding the world. Same chunker, same SQL upsert, just one row.
Commits get aggregated, not chunked individually. This is the most non-obvious decision in the whole pipeline. If you embed every commit one-by-one, you drown the index in noise. I instead deduplicate commits already present inside a PR (they're embedded with the PR) and then summarize the leftover "orphan" commits into a single chunk:

// Orphans = commits not already inside any PR
const prCommitShas = new Set(data.pullRequests.flatMap(pr =>
  pr.commits.map(c => c.sha)))
const orphans = data.commits.filter(c => !prCommitShas.has(c.sha) &&
  !isNoiseCommit(c))

if (orphans.length > 0) {
  chunks.push({
    sourceId: 'commits:orphan-summary',
    content: truncate(
      `[Commits] ${orphans.length} standalone commits (not in PRs) | Authors: ${authorsStr} | Recent: ${recentMsgs}`,
      4000
    ),
    metadata: { type: 'commits', count: orphans.length, orphan: true }
  })
}

Before the orphan rolls up, a noise filter strips anything useless:

const NOISE_MSG_PATTERNS = [
  /^merge branch/i, /^merge pull request/i, /^wip$/i, /^fix typo/i,
  /^fixup!/i, /^squash!/i, /^initial commit$/i, /^update \S+$/i
]
const NOISE_AUTHOR_PATTERNS = [/\[bot\]$/, /^dependabot/i, /^renovate/i, /^github-actions/i]

Filtering bot commits and merge noise before they hit the embedding API saves cost, keeps the index dense, and stops "what's the architecture" queries from returning seventeen dependabot bumped lodash chunks.

So… don't embed garbage!

2. Embeddings: picking the right model

The boring truth: most embedding models are good enough. The real trade-off is dimension count × cost × domain fit.

I went with Mistral AI codestral-embed-2505 (1536 dimensions), a code-tuned embedding model ranks in a way a general-purpose model does not.

Two main reasons:

Generous free-tier → Mistral's free tier is generous enough to run real embedding workloads without hitting a paywall on day one of a side project. OpenAI's free credits evaporate the moment you embed a real dataset.
Domain fit → My data is code-adjacent: commit messages, file paths, PR titles.

The call itself is unremarkable, which is the point. The work happens in the chunking, not here:

// At query time, embed the user's question with the same model
const { embedding } = await embed({
  model: mistral.textEmbeddingModel('codestral-embed-2505'),
  value: query
})

3. Retrieval (that doesn't suck)

Retrieval can be one query:

const vectorStr       = `[${embedding.join(',')}]`
const projectIdsArray = `{${projectIds.join(',')}}`

const results = await prisma.$queryRawUnsafe<MatchEmbeddingRow[]>(
  `SELECT pe.id, pe."projectId" as project_id, p.name as project_name,
          pe."sourceId" as source_id, pe.content, pe.metadata,
          (1 - (pe."embedding" <=> $1::vector(1536)))::float as similarity
   FROM "ProjectEmbedding" pe
   JOIN "Project" p ON p.id = pe."projectId"
   WHERE pe."projectId" = ANY($2::text[])
     AND 1 - (pe."embedding" <=> $1::vector(1536)) > 0.3
   ORDER BY pe."embedding" <=> $1::vector(1536)
   LIMIT 12`,
  vectorStr,
  projectIdsArray
)

Five things this is doing on purpose:

<=> is the pgvector cosine-distance operator. Combined with the HNSW index built on vector_cosine_ops, this query uses the index instead of a sequential scan.
Pre-filter by projectId = ANY(...) before the vector search. Permissioning happens before similarity ranking, so you never see a chunk from a project you don't have access to, and the index narrows the search space.
Threshold of 0.3 similarity. Below that, the chunk is more noise than signal. Lower threshold → more recall → more garbage in the prompt. Tune this on real queries, not synthetic ones.
Top 12 results. Enough that 2-3 misses still leave a usable signal; small enough that the prompt stays cheap. I started at 25 and it was overkill. The model latched onto the first 5 anyway and the rest were filler.
JOIN the Project name in the SELECT. When the query spans multiple repos, the model needs to know which repo a chunk came from. The repo name shows up in the chunk payload, which is what lets the answer cite [repo-A] vs [repo-B] accurately.

No re-ranker. No keyword pre-filter. One stage.

The chunking does enough work upfront that a second-stage ranker hasn't been worth its latency yet, and that's a real result, not laziness.

Re-rankers earn their keep when your chunks are big, noisy, and undifferentiated. My chunks are small, typed, and prefixed.

4. Context assembly and the LLM call

This is where Keystone diverges from textbook RAG.

Classic RAG does this:

embed(query) → search → concat(top_k) → prompt → generate

I do this:

prompt(LLM, tools={search, tree, file}) → LLM decides → up to 10 tool calls → final answer

The LLM is the orchestrator. It sees a system prompt that explains the two data sources, vectorized memory vs. live code, and the available repos. Then it chooses which tool to call.

The split looks like this:

Vectorized memory holds the why → PR descriptions, issue threads, commit messages, the artifacts where decisions are explained. Vectors of these stay useful even when the code drifts.
Live file access holds the what → The current package.json, the current list of plugins, the current value of a constant. Stale vectors of months-old code lie about the present, so for "what" questions I read the file fresh via the GitHub API.

Here's what the agentic retrieval actually looks like in production logs:

[Chat Tool] searchTechnicalMemory: "relationship between open-webui, opencode, and openclaw" across 3 project(s)

[Chat Tool] Found 9 results [
  { repo: 'open-webui', sourceId: 'readme:root',            similarity: 0.555 },
  { repo: 'openclaw',   sourceId: 'readme:root',            similarity: 0.544 },
  { repo: 'opencode',   sourceId: 'readme:root',            similarity: 0.503 },
  { repo: 'openclaw',   sourceId: 'topology:tree',          similarity: 0.465 },
  { repo: 'open-webui', sourceId: 'topology:tree',          similarity: 0.463 },
  { repo: 'opencode',   sourceId: 'topology:tree',          similarity: 0.463 },
  { repo: 'opencode',   sourceId: 'commits:orphan-summary', similarity: 0.439 },
  { repo: 'openclaw',   sourceId: 'commits:orphan-summary', similarity: 0.431 },
  { repo: 'open-webui', sourceId: 'commits:orphan-summary', similarity: 0.420 }
]

The model chose to search across all three repos in a single call, it understood the query was cross-project without being told.

The readme:root chunks rank highest (0.55, 0.54, 0.50) because READMEs describe what a project is, and the query asks exactly that.
The topology:tree chunks rank next: file structure is the second most useful signal for understanding how three repos relate.
The commits:orphan-summary chunks come in last but still above the 0.3 floor, adding commit-level context without the noise of individual commits.

Two practical effects:

The model can iterate → It might search memory, realize the answer needs a file, fetch the file, then answer.
The prompt stays small → Only the chunks the model actually requested make it into the conversation. No "stuff top-25 into system prompt" bloat.

The synthesis model itself is Mistral AI devstral-small-latest: small, cheap, fast. With good retrieval you don't need a frontier model for the writing step. The expensive part of "intelligence" is finding the right context. Writing a coherent paragraph from good context is the easy part.

Every call gets logged with input/output tokens, step count, and finish reason, both to a usage table and to PostHog. That's the observability layer that lets me actually answer "is retrieval getting better or worse this week?" with a graph instead of a vibe.

Closing

The pipeline above (typed chunking, code-tuned embeddings, HNSW + pgvector, and an LLM that knows when to search) is what's running inside Keystone today.

It's small, opinionated, and it works because every stage has one job and respects the constraints of the next.

If there's one thing to take away: ignore the model leaderboards for a week and go obsess over your chunking. That's where the wins are.

The fanciest embedding model in the world can't rescue data that's been concatenated into mush, and the cheapest model is plenty good when the chunks coming in are sharp, typed, and free of noise.

RAG isn't a magic upgrade for LLMs. It's a librarian, and a librarian is only as good as the way you organized the shelves.

Keystone is the project I'm building to give software teams a living memory of their codebase: every PR, commit, issue, and decision, queryable in natural language.

If you have any suggestions I'd love to hear them on the comments section!

Thanks for reading! 👋

Wences.

Black Hat USA

AI Business

Built a tool that stops AI agents from being hijacked by malicious content in webpages and emails

Reddit r/artificial

llama.cpp docker images to run MTP models

Reddit r/LocalLLaMA

AI Hairstyle Simulator — See the New You for Just $4.99 1 follower

Dev.to

Anthropic's First Banking Agent Just Went Into AML