Update on my February posts about replacing RAG retrieval with NL querying — some things I've learned from actually building it

Reddit r/artificial / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The author revisits an earlier idea of replacing embedding-based RAG retrieval with NL querying over a citation-grounded document store, sharing lessons from implementing the approach.
  • In practice, the system still benefits from a hybrid retrieval pipeline, but the key failure was vocabulary mismatch between queries and content, so the fix was a lightweight, tag-based index that narrows candidates structurally before NL querying.
  • The interface LLM may try to rely on its own internal memory instead of querying the memory store, so the author found that prompt requirements, startup gate checklists, and explicit cost framing are needed to enforce retrieval behavior.
  • The persistent “memory layer” is expected to be decoupled from the interface model because the state lives in the document store, enabling easier model swapping or even multi-model concurrent querying to coordinate against the same memory.
  • The author also notes a potential asymmetry: since retrieval depends on the interface model’s reasoning (not embeddings), using a stronger model could improve retrieval quality directly, not just final answer generation.

A couple of months ago I posted here (r/LLMDevs, r/artificial) proposing that an LLM could save its context window into a citation-grounded document store and query it in plain language, replacing embedding similarity as the retrieval mechanism for reasoning recovery. Karpathy's LLM Knowledge Bases post and a recent TDS context engineering piece have since touched on similar territory, so it felt like a good time to resurface with what I've actually found building it.

The hybrid question got answered in practice

Several commenters in the original threads predicted you'd inevitably end up hybrid — cheap vector filter first, LLM reasoning over the shortlist. That's roughly right, but the failure mode that drove it was different from what I expected. Pure semantic search didn't degrade because of scale per se; it started missing retrievals because the query and the target content used different vocabulary for the same concept. The fix was an index-first strategy — a lightweight topic-tagged index that narrows candidates before the NL query runs. So the hybrid layer is structural metadata, not a vector pre-filter.

The LLM resists using its own memory

This one surprised me. Claude has a persistent tendency to prefer internal reasoning over querying the memory store, even when a query would return more accurate results. Left unchecked, it reconstructs rather than retrieves — which is exactly the failure mode the system was designed to prevent. Fixing it required encoding the query requirement in the system prompt, a startup gate checklist, and explicit framing of what it costs to skip retrieval. It's behavioral, not architectural, but it's a real problem that neither article addresses.

The memory layer should decouple from the interface model

One thing I haven't tested but follows logically from the architecture: if the persistent state lives in the document store rather than in the model, the interface LLM becomes interchangeable. You should be able to swap Claude for ChatGPT or Gemini with minimal fidelity loss, and potentially run multiple models concurrently against the same memory as a coordination layer. There's also an interesting quality asymmetry that wouldn't exist in vector RAG: because retrieval here uses the interface model's reasoning rather than a separate embedding step, a more capable model should directly improve retrieval quality — not just generation quality. I haven't verified either of these in practice, but the architecture seems to imply them. Curious whether anyone has tested something similar.

Memory hygiene is a real maintenance problem

Karpathy's post talks about "linting" the wiki for inconsistencies. I ran into a version of this from a different angle: an append-only notes system accumulates stale entries with no way to distinguish resolved from active items. You end up needing something like a note lifecycle (e.g., resolve, revise, retract, etc.) with versioned identifiers so the system can tell what's current. The maintenance overhead of keeping memory coherent is underappreciated in both the Karpathy and TDS pieces.

Still in the research and build phase. For anyone curious about the ad hoc system I've been using to test this while working through the supporting literature, the repo is here: https://github.com/pjmattingly/Claude-persistent-memory — pre-alpha quality, but it's the working substrate behind the observations above. Happy to go deeper on any of this.

submitted by /u/Particular-Welcome-1
[link] [comments]