RAG Series (1): Why LLMs Need External Memory

Dev.to / 5/2/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The article explains that most LLM “hallucinations” stem from two root causes: a frozen knowledge cutoff after training and the model’s tendency to generate fluent answers when it lacks information.
  • It argues that an LLM’s internal memory is locked at training time, so it cannot access today’s events or private documents and therefore either admits ignorance or invents plausible content.
  • The piece compares three engineering approaches to address knowledge limitations—fine-tuning, long context, and RAG—highlighting their mechanisms, best-fit scenarios, and trade-offs in cost and updateability.
  • It specifically challenges the misconception that fine-tuning is the best way to inject new facts, stating that fine-tuning mainly changes behavior and language patterns rather than “storing” books, and may reproduce errors in training data.
  • It positions RAG as separating “what to know” (in an external, updatable database) from “how to say it” (handled by the model), requiring retrieval infrastructure to work effectively.

Two Root Causes Behind LLM "Hallucinations"

Anyone who has worked with large language models has run into these two situations:

Situation 1: Knowledge Cutoff

You: What were our company's Q1 sales figures?
GPT: I'm sorry, my training data only goes up to early 2024 and I have
     no access to your company's internal data.

Situation 2: Hallucination

You: How do I use LangChain's RunnablePassthrough?
GPT: RunnablePassthrough can be enabled by calling .with_config(pass_through=True)...
     (This parameter doesn't exist.)

Both problems share the same root cause: an LLM's knowledge is frozen at training time.

The moment training completes, the model's "memory" is locked in place. It has no idea what happened today, knows nothing about your internal documents, and won't go look things up—it can only answer from memory. When memory runs dry, it either admits ignorance or invents a plausible-sounding answer.

That's where hallucinations come from: the model uses fluent language to fill in the gaps in its knowledge.

Three Solutions and How They Compare

There are three engineering approaches to this problem:

Approach Mechanism Best For Cost
Fine-tuning Retrain on new data, "bake" knowledge into parameters Fixed-domain language style, output format Expensive, slow to update, limited factual recall
Long Context Stuff all documents into the prompt Small document sets, one-off queries Token cost grows exponentially; quality degrades at extreme lengths
RAG Dynamically retrieve relevant content at query time, inject into prompt Large knowledge bases, continuously updated data Requires retrieval infrastructure

A common misconception: fine-tuning is not good at injecting new facts.

Fine-tuning changes a model's behavioral patterns and language style—it doesn't "store a book" inside the parameters. Experiments consistently show that fine-tuning on specific Q&A pairs produces limited accuracy gains on related questions, and if training data contains errors, the model confidently repeats those errors.

RAG's core advantage is separating "what to know" from "how to say it":

  • Knowledge lives in an external database and can be updated anytime
  • The model focuses purely on understanding and generation, not memorization

When should you use long context instead of RAG?
When the total document volume is under ~100K tokens, the query is one-off (not recurring), and API costs are acceptable, long context is often simpler. Claude and Gemini's extended context windows make "stuffing a whole book in" genuinely viable. But for enterprise knowledge bases—thousands of documents, continuous updates, multiple concurrent users—RAG remains the more sensible architecture.

What RAG Is: An Open-Book Exam Analogy

RAG = Retrieval-Augmented Generation.

The most intuitive way to think about it: turning a closed-book exam into an open-book exam.

Closed-book (pure LLM): The student answers purely from memory. Anything not memorized gets guessed.

Open-book (RAG): The student can consult reference materials, but still needs to understand the question, find the relevant content, and compose the answer. The reference materials are the external knowledge base; looking things up is the retrieval step.

This analogy reveals two key properties of RAG:

  1. Knowledge lives outside the model — it can be swapped and updated independently
  2. The model handles understanding and generation — after retrieval, the model still needs to "read" the content and produce a coherent response

The Complete RAG Pipeline

RAG operates in two distinct phases: the indexing phase (one-time, offline) and the query phase (real-time, per request).

RAG Architecture Overview

The two-phase RAG architecture — top: offline indexing pipeline; bottom: real-time query pipeline; both share the same Vector DB

Indexing Phase

This phase completes before any user query arrives. It's a one-time preprocessing step.

Raw Documents → Document Loading → Text Splitting → Embedding → Vector Database

Step 1: Document Loading

Convert raw content from various formats into plain text. PDFs, Word docs, Markdown, web pages, code—each format has its own parsing challenges (tables and images in PDFs are notoriously tricky).

Step 2: Text Splitting (Chunking)

Cut long documents into smaller chunks. This step has a significant impact on final quality—chunks too large reduce retrieval precision; chunks too small lose semantic coherence. (Chunking strategies are covered in depth in a later article; for now, just understand why we split.)

Step 3: Embedding

Use an embedding model to convert each text chunk into a high-dimensional vector. This vector captures the semantic meaning of the text—semantically similar texts produce vectors that are close together in the vector space.

Step 4: Store in Vector Database

Store all vectors along with their original text in a vector database that supports similarity search (Chroma, Qdrant, Weaviate, etc.).

Query Phase

Executed in real time for every user request.

User Question → Embedding → Similarity Search → Retrieved Chunks → Prompt Assembly → LLM → Answer

Step 1: Query Embedding

Convert the user's question into a vector using the same embedding model.

Step 2: Similarity Search

Find the Top-K most similar text chunks in the vector database. Similarity is measured by distance in vector space (cosine similarity, etc.).

Step 3: Prompt Assembly

Combine the retrieved text chunks with the user's question into a complete prompt and send it to the LLM. A typical format:

You are a knowledge assistant. Answer the user's question based solely on
the reference content provided below.

Reference content:
[Retrieved chunk 1]
[Retrieved chunk 2]
...

User question: [original question]

Please base your answer on the reference content. If the reference content
does not contain relevant information, say so clearly.

Step 4: LLM Generation

The LLM generates an answer grounded in the provided context, rather than from its internal parameters alone.

Hands-On: A Minimal RAG in 100 Lines

No frameworks—just the OpenAI API. Let's implement a working RAG from scratch. The goal is to see exactly what each step does, without the abstraction layers of a framework hiding the details.

"""
Minimal RAG implementation — no frameworks, OpenAI API only.
Demonstrates the complete RAG pipeline: indexing + querying.
"""

import numpy as np
from openai import OpenAI

client = OpenAI()  # requires OPENAI_API_KEY environment variable

# ─────────────────────────────────────────
# Simulated knowledge base: 5 technical docs
# ─────────────────────────────────────────
DOCUMENTS = [
    "LangChain is a framework for building LLM applications, providing chaining, memory management, and tool integration.",
    "Vector databases enable semantic search by converting text into high-dimensional vectors. Common options include Chroma, Qdrant, Weaviate, and Pinecone.",
    "RAG (Retrieval-Augmented Generation) reduces LLM hallucinations by retrieving relevant documents before generation, improving answer accuracy.",
    "Embedding models convert text into fixed-dimension vectors, where semantically similar texts are positioned closer together in vector space.",
    "Fine-tuning retrains a model on specific data to adjust its behavior — it's best suited for changing output style, not injecting new factual knowledge.",
]


# ─────────────────────────────────────────
# Indexing phase: convert documents to vectors
# ─────────────────────────────────────────
def build_index(documents: list[str]) -> list[dict]:
    """
    Convert each document into a vector.
    Returns [{text, embedding}, ...]
    In production, store these in a vector database.
    """
    print(f"Indexing {len(documents)} documents...")

    index = []
    for doc in documents:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=doc
        )
        embedding = response.data[0].embedding
        index.append({"text": doc, "embedding": embedding})

    print("Index built successfully.")
    return index


def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """Cosine similarity between two vectors. Range: -1 to 1, higher = more similar."""
    a = np.array(vec_a)
    b = np.array(vec_b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))


# ─────────────────────────────────────────
# Query phase: retrieve + generate
# ─────────────────────────────────────────
def retrieve(query: str, index: list[dict], top_k: int = 2) -> list[str]:
    """
    Embed the query and find the top_k most similar documents.
    """
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = response.data[0].embedding

    scored = []
    for doc in index:
        score = cosine_similarity(query_embedding, doc["embedding"])
        scored.append((score, doc["text"]))

    scored.sort(key=lambda x: x[0], reverse=True)
    return [text for _, text in scored[:top_k]]


def generate(query: str, context_docs: list[str]) -> str:
    """
    Assemble retrieved docs + user question into a prompt and call the LLM.
    """
    context = "
".join([f"- {doc}" for doc in context_docs])

    prompt = f"""You are a knowledge assistant. Answer the user's question based
solely on the reference content below. If the reference content does not contain
relevant information, say so clearly — do not make anything up.

Reference content:
{context}

User question: {query}

Answer:"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content


def rag_query(query: str, index: list[dict]) -> str:
    """Full RAG query pipeline."""
    print(f"
Question: {query}")

    docs = retrieve(query, index, top_k=2)
    print(f"Retrieved {len(docs)} relevant documents:")
    for i, doc in enumerate(docs, 1):
        print(f"  [{i}] {doc[:60]}...")

    answer = generate(query, docs)
    print(f"
Answer: {answer}")
    return answer


# ─────────────────────────────────────────
# Run the demo
# ─────────────────────────────────────────
if __name__ == "__main__":
    # Build the index (only needs to be done once in a real system)
    index = build_index(DOCUMENTS)

    # Test with a few questions
    rag_query("What is a vector database?", index)
    rag_query("What's the difference between RAG and fine-tuning?", index)
    rag_query("What is Python's GIL?", index)  # Not in the knowledge base — testing refusal

Sample output:

Indexing 5 documents...
Index built successfully.

Question: What is a vector database?
Retrieved 2 relevant documents:
  [1] Vector databases enable semantic search by converting text into high-dim...
  [2] Embedding models convert text into fixed-dimension vectors, where semant...

Answer: A vector database is a database system that enables semantic search by
converting text into high-dimensional vectors. Common examples include Chroma,
Qdrant, Weaviate, and Pinecone...

Question: What is Python's GIL?
Retrieved 2 relevant documents:
  [1] LangChain is a framework for building LLM applications...
  [2] Fine-tuning retrains a model on specific data...

Answer: Based on the provided reference content, I cannot answer your question
about Python's GIL — the reference material does not contain relevant information.

Notice the last question: the knowledge base has nothing about Python's GIL, and the LLM explicitly says it can't answer rather than inventing a response. This is how RAG controls hallucinations: a constraint in the prompt instructs the model to answer only from retrieved content.

Limitations of This Implementation

The 100 lines above demonstrate the complete RAG pipeline, but there are obvious shortcomings:

Problem Cause Engineering Solution
Vectors live in memory, lost on restart No persistence Vector database (Chroma / Qdrant)
Long documents passed in directly will exceed token limits No chunking Text Splitter strategies (next article)
Poor keyword matching; pure vector retrieval only No hybrid search Hybrid search (later in series)
No quality measurement No evaluation RAGAS evaluation framework (later in series)

Each limitation maps directly to a topic covered in the upcoming articles.

Summary

This article addressed three core questions:

  1. Why RAG? — LLM knowledge cutoff and hallucinations both stem from knowledge being frozen in model parameters
  2. What is RAG? — Dynamically retrieve external knowledge at query time, inject it into the prompt, and let the LLM answer based on evidence
  3. RAG vs. alternatives — Fine-tuning changes behavior; long context works for small documents; RAG is built for large-scale, continuously-updated knowledge bases

Next up: the first deep dive into RAG's core components — text chunking strategies. Why does the chunking approach have such a dramatic impact on quality, and how do you choose between the four main strategies?

References