Lessons from deploying RAG bots for regulated industries

Reddit r/LocalLLaMA / 3/29/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The article argues that retrieval quality in RAG deployments improves more from query expansion (generating multiple alternative phrasings and merging results) than from fine-tuning chunk size.
  • It recommends adding a “source boost” mechanism to force-include chunks from named or title-matched documents, ensuring users get authoritative policy answers even when semantic similarity is imperfect.
  • It describes a layered prompt architecture for regulated settings, with immutable Layer 1 safety/security rules that cannot be overridden by vertical persona prompts or client-specific instructions.
  • The author reports that local embeddings using sentence-transformers (MiniLM-L6-v2) with a local vector DB can be sufficient for domain Q&A, yielding cost/latency benefits while the LLM does much of the effective work.
  • For operations and compliance isolation, the piece concludes that using one dedicated vector store (“one droplet per client”) reduces cross-contamination and key/collection management overhead compared with shared infrastructure.

Built a RAG-powered AI assistant for Australian workplace compliance use cases. Deployed it across construction sites, aged care facilities, and mining operations. Here's what I learned the hard way:

  1. Query expansion matters more than chunk size

Everyone obsesses over chunk size (400 words? 512 tokens?). The real win was generating 4 alternative phrasings of each query via Haiku, running all 4 against ChromaDB, then merging and deduplicating results. Retrieval quality jumped noticeably — especially for domain-specific jargon where users phrase things differently than document authors.

  1. Source boost for named documents

If a user's query contains words that match an indexed document title, force-include chunks from that doc regardless of semantic similarity. "What does our FIFO policy say about R&R flights?" should always pull from the FIFO policy — not just semantically similar chunks that happen to mention flights.

  1. Layer your prompts — don't let clients break Layer 1

Three-layer system: core security/safety rules (immutable), vertical personality (swappable per industry), client custom instructions (additive only). Clients cannot override Layer 1 via their custom instructions. Saved me from "ignore previous instructions" attacks and clients accidentally jailbreaking their own bots.

  1. Local embeddings are good enough

sentence-transformers all-MiniLM-L6-v2 running locally on ChromaDB. No external embedding API. For document Q&A in a specific domain, it performs close enough to ada-002 that the cost and latency savings are worth it. The LLM quality (Claude Haiku) is doing more work than the embeddings anyway.

  1. One droplet per client

Tried shared infrastructure first. The operational overhead of keeping ChromaDB collections isolated, managing API keys, and preventing cross-contamination was worse than just spinning a $6/mo VM per client. Each client owns their vector store. Their documents never touch shared infrastructure.

Happy to share code — RAG engine is on GitHub if anyone wants to pick it apart.

submitted by /u/Neoprince86
[link] [comments]