Got some good engagement on my earlier post "I made €2,700 building a RAG system for a law firm — here's what actually worked technically" so I wanted to go deeper into the actual architecture for anyone building something similar.
Shipped a RAG system for a German GDPR compliance company. Sharing the full stack because I haven't seen many production legal RAG breakdowns and I ran into problems that generic RAG tutorials don't cover.
The problem: legal research isn't just "find relevant text." Different sources have different legal weight. A Supreme Court ruling beats a lower court opinion. An official regulatory guideline beats a blog post. The system needs to know this hierarchy and use it when generating answers.
Here's how I solved it:
- Three retrieval strategies selectable per query. Flat (standard RAG, all sources equal), Category Priority (sources grouped by authority tier, LLM resolves conflicts top down), and Layered Category (independent search per category so every authority level gets representation even if one category dominates similarity scores). Without the category priority approach the system would sometimes build answers from lower authority sources just because they had better semantic similarity to the query.
- Custom chunking pipeline for legal documents. Nested clause structures, cross references between sections, footnotes that reference other documents. Built a chunker that preserves hierarchical depth and section relationships. Chunks get assembled into condensed "cheatsheets" before hitting the LLM. These are cached with deterministic hashing so repeated patterns skip regeneration.
- Dual embedding support. AWS Bedrock Titan for production and local Ollama as fallback. Swappable from the admin panel without restarting the app. Embeddings are cached per provider and model combo with thread safe locking so switching models doesn't corrupt anything.
- Metadata injection layer. After vector search every retrieved chunk gets enriched with full document metadata from the database in a single batched query. Region, category, framework, date, tags, and all user annotations attached to that document. This rides alongside the chunk content into the prompt.
- Bilingual with hard language enforcement. Regex based detection identifies German vs English in the query. The prompt forces output in the detected language and explicitly blocks drifting into French or other languages. This actually happens more than you'd think when source documents are multilingual.
- Source citation engineering. Probably 40% of my prompt engineering time went here. The prompts contain explicit "NEVER do X" instructions for every lazy citation pattern I caught during testing. No "according to professional literature" without naming the document. Must cite exact document titles, exact court names, exact article numbers. For legal use vague attribution is worthless.
- Streaming with optional simplification pass. Answers stream via SSE. Second LLM pass can intercept the completed stream, rewrite the full legal analysis in plain language, then stream the simplified version as separate tokens. Adds latency but non lawyers needed plain language explanations of complex GDPR obligations.
Stack: FastAPI backend, AWS Bedrock with Claude for generation, Bedrock Titan for embeddings with Ollama as local fallback, FAISS for vector search, PostgreSQL for document metadata and comments. Deployed in EU region for GDPR compliance of the tool itself.
€2,700 for the complete build. Now in conversations about recurring monthly maintenance. Biggest lesson: domain specific RAG is 80% prompt engineering and metadata architecture 20% retrieval. Making the LLM behave like a legal professional who respects authority hierarchies and cites sources properly was the real work.
Happy to answer questions if anyone is building something similar or thinking about going into professional services RAG.
[link] [comments]




