Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

Reddit r/MachineLearning / 4/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The author reports running a production retrieval-augmented generation (RAG) system for German legal/regulatory documents and notes it succeeds on about 80% of queries but fails in three predictable patterns.
  • The “scatter problem” occurs when a question requires information spread across many documents with small contributions, causing vector search to retrieve only a few well-matching documents and produce partial but seemingly complete answers.
  • The “negative knowledge problem” describes cases where the knowledge base has no relevant guidance, yet the system still returns a confident-sounding answer by synthesizing from vaguely related retrieved chunks, and threshold gating does not reliably separate on-topic from off-topic.
  • The “timeline problem” arises for questions requiring before/after interpretation around specific dates or rulings, where the system struggles to construct coherent temporal narratives when retrieved chunks do not explicitly connect to each other.
  • The author concludes that addressing these issues likely requires fundamentally different retrieval strategies (e.g., query decomposition tailored to the domain, temporal filtering, or period-specific retrieval) rather than minor prompt or parameter tweaks.

I've had a RAG system running in production for a few months now (legal domain, German regulatory documents). It handles 80% of queries well but there are three patterns where it fails predictably and I haven't found clean solutions.

The scatter problem.

Some questions need information from 8-10 different documents where each one contributes just a small piece. Vector search finds chunks related to the query but not chunks related to each other. So when someone asks something like "compare how notification deadlines work across different German federal states" the system finds 2-3 state-specific documents that happen to match the query well and misses the rest. The answer looks complete but it's actually partial. Cranking up k adds noise and burns tokens without reliably solving it because the missing documents might use completely different terminology for the same concept.

I've thought about query decomposition (break the question into sub-queries per state) but that assumes the system knows upfront how many sub-queries to generate and what dimensions to decompose along. For a general-purpose research tool that feels brittle.

The negative knowledge problem.

When someone asks "do we have any guidance on employee monitoring" and the answer is genuinely no, the system can't cleanly say that. It retrieves whatever chunks are least irrelevant, and the LLM synthesizes something from them anyway. The user gets a confident-sounding answer about a tangentially related topic instead of a straightforward "this isn't covered in your knowledge base."

I've tried similarity score thresholds as a gate but the problem is there's no clean boundary. A legitimate but unusual query might have low similarity scores. A genuinely off-topic query might match some chunks reasonably well because of shared vocabulary. Every threshold I've tested either filters out too much or too little. The prompt instruction to admit uncertainty helps maybe 60% of the time. The other 40% the model just reaches.

The timeline problem.

Questions like "how did the interpretation of X change after the 2023 ruling" require the system to find pre-ruling documents, find post-ruling documents, understand the temporal relationship, and construct a comparative narrative. The metadata has document dates. The prompt says to respect temporal ordering. But the model struggles to build a coherent before/after story when the retrieved chunks don't explicitly reference each other. It tends to either merge everything into one flat answer or just cite the newer source and ignore the older interpretation.

This feels like it needs a fundamentally different retrieval approach (maybe temporal filtering at the search level, or separate retrievals for different time periods) rather than more prompt engineering.

I've been reading about graph RAG approaches, agentic retrieval loops, and multi-hop reasoning chains but most of the literature is benchmarks on synthetic datasets, not production implementations. If anyone has actually deployed solutions for any of these three patterns I'd really like to hear what worked and what didn't. Especially interested in approaches that don't require restructuring the entire pipeline.

submitted by /u/Fabulous-Pea-5366
[link] [comments]