RAG Systems in Production: Building Enterprise Knowledge Search
Introduction
Retrieval-Augmented Generation (RAG) has revolutionized how enterprises build intelligent knowledge systems. By combining the power of large language models with domain-specific knowledge, RAG systems can answer questions, synthesize information, and provide insights that pure LLMs cannot achieve alone.
At Groovy Web, we've built and deployed RAG systems for Fortune 500 companies, helping them unlock the value of their organizational knowledge. This guide captures everything we've learned from building production RAG systems that serve millions of queries per month.
Table of Contents
- Understanding RAG Systems
- System Architecture
- Vector Database Selection
- Embedding Strategies
- Chunking Techniques
- Retrieval Optimization
- Generation and Synthesis
- Evaluation and Quality Assurance
- Scaling Considerations
- Production Deployment
- Monitoring and Observability
- Real-World Implementation
Understanding RAG Systems
What is RAG?
RAG (Retrieval-Augmented Generation) is a technique that enhances large language models by retrieving relevant context from a knowledge base before generating responses.
Without RAG:
User Question → LLM → Answer (Limited to training data)
With RAG:
User Question → Retrieve Relevant Documents → LLM + Context → Answer (Grounded in knowledge base)
Why RAG for Enterprise?
1. Domain-Specific Knowledge
LLMs are trained on public internet data, but enterprises have proprietary information:
- Internal documentation
- Product specifications
- Customer interactions
- Research papers
- Compliance documents
RAG systems enable LLMs to access this private knowledge.
2. Reduced Hallucinations
By grounding responses in retrieved documents, RAG systems:
- Cite sources
- Provide verifiable information
- Reduce false claims
- Build user trust
3. Cost-Effective
Compared to fine-tuning:
- No model training required
- Easy to update knowledge base
- Lower infrastructure costs
- Faster time to production
4. Transparency and Compliance
RAG systems provide:
- Source attribution
- Audit trails
- Compliance with regulations
- Explainable AI
RAG vs Fine-Tuning
| Aspect | RAG | Fine-Tuning |
|---|---|---|
| Knowledge updates | Instant (add to database) | Requires retraining |
| Cost | Low ($/query) | High (training costs) |
| Domain specificity | High (source data) | Medium (pattern learning) |
| Hallucination risk | Low (grounded) | Medium (model-based) |
| Transparency | High (citations) | Low (black box) |
| Setup time | Days to weeks | Weeks to months |
| Maintenance | Ongoing indexing | Periodic retraining |
Best Use Cases for RAG:
- Knowledge search and Q&A
- Document analysis
- Customer support automation
- Research assistance
- Compliance and legal review
Best Use Cases for Fine-Tuning:
- Style and tone customization
- Format standardization
- Domain-specific reasoning
- Specialized instruction following
System Architecture
End-to-End RAG Pipeline
┌─────────────────────────────────────────────────────────────┐
│ KNOWLEDGE BASE │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Documents │ │ Vectors │ │ Metadata │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ Ingestion Pipeline
▼
┌─────────────────────────────────────────────────────────────┐
│ PROCESSING LAYER │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Chunk │→│ Embed │→│ Index │→│ Store │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ Query
▼
┌─────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Query │→│ Semantic │→│ Hybrid │ │
│ │ Embedding │ │ Search │ │ Search │ │
│ └────────────┘ └────────────┘ └────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ Rerank & │ │
│ │ Filter │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ Context
▼
┌─────────────────────────────────────────────────────────────┐
│ GENERATION LAYER │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ Prompt │→│ LLM │→│ Response │ │
│ │ Building │ │ Inference │ │ Synthesis │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
User Response
Component Breakdown
1. Ingestion Pipeline
# ingestion/pipeline.py
from typing import List, Dict
from pathlib import Path
import hashlib
class DocumentIngestionPipeline:
"""Process and ingest documents into knowledge base"""
def __init__(self, config: Dict):
self.chunker = DocumentChunker(config['chunking'])
self.embedder = EmbeddingGenerator(config['embeddings'])
self.vector_store = VectorStore(config['vector_db'])
async def ingest_document(self, document: Dict) -> List[str]:
"""
Ingest a document into the knowledge base
Returns: List of chunk IDs
"""
# 1. Extract text and metadata
text = document['content']
metadata = {
'title': document['title'],
'source': document['source'],
'author': document.get('author'),
'created_at': document.get('created_at'),
'doc_type': document.get('type', 'unknown'),
'language': document.get('language', 'en')
}
# 2. Split into chunks
chunks = self.chunker.chunk(text)
# 3. Generate embeddings
chunk_texts = [chunk['text'] for chunk in chunks]
embeddings = await self.embedder.generate_batch(chunk_texts)
# 4. Prepare records for storage
records = []
for chunk, embedding in zip(chunks, embeddings):
record = {
'id': self._generate_chunk_id(document['id'], chunk['index']),
'document_id': document['id'],
'text': chunk['text'],
'embedding': embedding,
'metadata': {
**metadata,
'chunk_index': chunk['index'],
'chunk_size': len(chunk['text']),
'start_char': chunk['start'],
'end_char': chunk['end']
}
}
records.append(record)
# 5. Store in vector database
chunk_ids = await self.vector_store.insert(records)
return chunk_ids
def _generate_chunk_id(self, doc_id: str, chunk_index: int) -> str:
"""Generate unique chunk ID"""
hash_input = f"{doc_id}_{chunk_index}"
return hashlib.sha256(hash_input.encode()).hexdigest()[:32]
2. Retrieval Engine
# retrieval/engine.py
from typing import List, Dict, Optional
import numpy as np
class RetrievalEngine:
"""Retrieve relevant documents for queries"""
def __init__(self, vector_store, embedder, config: Dict):
self.vector_store = vector_store
self.embedder = embedder
self.config = config
self.reranker = Reranker(config.get('reranking'))
async def retrieve(
self,
query: str,
top_k: int = 10,
filters: Optional[Dict] = None
) -> List[Dict]:
"""
Retrieve relevant chunks for a query
Args:
query: User query
top_k: Number of results to return
filters: Metadata filters (e.g., {category: 'technology'})
Returns:
List of retrieved chunks with scores
"""
# 1. Generate query embedding
query_embedding = await self.embedder.generate(query)
# 2. Semantic search
results = await self.vector_store.similarity_search(
query_embedding,
top_k=top_k * 2, # Retrieve more for reranking
filters=filters
)
# 3. Rerank if configured
if self.reranker and len(results) > top_k:
results = await self.reranker.rerank(query, results, top_k)
return results[:top_k]
async def retrieve_with_hybrid_search(
self,
query: str,
top_k: int = 10,
alpha: float = 0.5,
filters: Optional[Dict] = None
) -> List[Dict]:
"""
Hybrid retrieval combining semantic and keyword search
Args:
query: User query
top_k: Number of results
alpha: Weight for semantic search (0-1)
filters: Metadata filters
Returns:
Reranked combined results
"""
# 1. Semantic search
semantic_results = await self.vector_store.similarity_search(
await self.embedder.generate(query),
top_k=top_k * 2,
filters=filters
)
# 2. Keyword search
keyword_results = await self.vector_store.keyword_search(
query,
top_k=top_k * 2,
filters=filters
)
# 3. Combine and rerank
combined = self._combine_results(
semantic_results,
keyword_results,
alpha
)
# 4. Rerank combined results
if self.reranker:
combined = await self.reranker.rerank(query, combined, top_k)
return combined[:top_k]
def _combine_results(
self,
semantic_results: List[Dict],
keyword_results: List[Dict],
alpha: float
) -> List[Dict]:
"""Combine semantic and keyword search results"""
# Score normalization
sem_scores = np.array([r['score'] for r in semantic_results])
key_scores = np.array([r['score'] for r in keyword_results])
sem_normalized = (sem_scores - sem_scores.min()) / (sem_scores.max() - sem_scores.min())
key_normalized = (key_scores - key_scores.min()) / (key_scores.max() - key_scores.min())
# Combine scores
for i, result in enumerate(semantic_results):
result['combined_score'] = alpha * sem_normalized[i]
for i, result in enumerate(keyword_results):
result['combined_score'] += (1 - alpha) * key_normalized[i]
# Merge and sort by combined score
seen = set()
combined = []
for result in semantic_results + keyword_results:
if result['id'] not in seen:
seen.add(result['id'])
combined.append(result)
combined.sort(key=lambda x: x['combined_score'], reverse=True)
return combined
3. Response Generator
# generation/generator.py
from typing import List, Dict
import openai
class ResponseGenerator:
"""Generate responses using retrieved context"""
def __init__(self, config: Dict):
self.client = openai.AsyncClient(api_key=config['api_key'])
self.model = config['model']
self.temperature = config.get('temperature', 0.3)
self.max_tokens = config.get('max_tokens', 1000)
async def generate_response(
self,
query: str,
context: List[Dict],
conversation_history: Optional[List[Dict]] = None
) -> Dict:
"""
Generate response using retrieved context
Args:
query: User query
context: Retrieved chunks
conversation_history: Previous messages (for chat)
Returns:
Generated response with citations
"""
# 1. Build prompt with context
prompt = self._build_prompt(query, context)
# 2. Generate response
messages = self._build_messages(prompt, conversation_history)
response = await self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=self.temperature,
max_tokens=self.max_tokens
)
# 3. Extract response and citations
answer = response.choices[0].message.content
citations = self._extract_citations(response, context)
return {
'answer': answer,
'citations': citations,
'sources': self._get_unique_sources(context),
'model': self.model,
'tokens_used': response.usage.total_tokens
}
def _build_prompt(self, query: str, context: List[Dict]) -> str:
"""Build prompt with context"""
context_str = "
".join([
f"[Source {i+1}]
{chunk['text']}"
for i, chunk in enumerate(context)
])
prompt = f"""You are a helpful assistant that answers questions based on the provided context.
Context:
{context_str}
Question: {query}
Instructions:
1. Answer the question using only the provided context
2. If the context doesn't contain enough information, say so
3. Cite sources using [Source X] notation
4. Be concise and accurate
5. If asked for sources, provide them
Answer:"""
return prompt
def _build_messages(
self,
prompt: str,
history: Optional[List[Dict]] = None
) -> List[Dict]:
"""Build message list for API"""
messages = []
if history:
messages.extend(history)
messages.append({
"role": "user",
"content": prompt
})
return messages
def _extract_citations(
self,
response: openai.ChatCompletion,
context: List[Dict]
) -> List[Dict]:
"""Extract citations from response"""
answer = response.choices[0].message.content
# Find source references like [Source 1], [Source 2], etc.
import re
citations = re.findall(r'\[Source (\d+)\]', answer)
# Map to actual source chunks
unique_citations = []
for citation in set(citations):
idx = int(citation) - 1 # Convert to 0-based index
if idx < len(context):
unique_citations.append({
'index': int(citation),
'chunk_id': context[idx]['id'],
'document_id': context[idx]['metadata']['document_id'],
'title': context[idx]['metadata']['title'],
'source': context[idx]['metadata']['source']
})
return unique_citations
def _get_unique_sources(self, context: List[Dict]) -> List[Dict]:
"""Get unique sources from context"""
seen = set()
sources = []
for chunk in context:
doc_id = chunk['metadata']['document_id']
if doc_id not in seen:
seen.add(doc_id)
sources.append({
'document_id': doc_id,
'title': chunk['metadata']['title'],
'source': chunk['metadata']['source'],
'author': chunk['metadata'].get('author'),
'created_at': chunk['metadata'].get('created_at')
})
return sources
Vector Database Selection
Comparison Matrix
| Database | Open Source | Cloud Managed | Performance | Scalability | Features | Cost |
|---|---|---|---|---|---|---|
| pgvector | ✅ | ✅ (Supabase, etc.) | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Relational DB + vectors | $ |
| Pinecone | ❌ | ✅ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Purpose-built, easy | $$$ |
| Weaviate | ✅ | ✅ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | GraphQL, multi-modal | $$ |
| Qdrant | ✅ | ✅ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Filter optimization, hybrid | $$ |
| Milvus | ✅ | ✅ (Zilliz) | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Distributed, cloud-native | $$ |
| Chroma | ✅ | ❌ | ⭐⭐⭐ | ⭐⭐⭐ | Simple, embedded | Free |
Selection Criteria
Choose pgvector if:
- Already using PostgreSQL
- Need ACID transactions
- Want to minimize infrastructure
- Budget-conscious
- Need SQL joins with vector search
Choose Pinecone if:
- Want fully managed solution
- Need auto-scaling
- Prioritize ease of setup
- Have budget for managed service
- Want fastest time to production
Choose Qdrant if:
- Need advanced filtering
- Want hybrid search capabilities
- Require high performance
- Prefer open-source with managed option
Choose Weaviate if:
- Need multi-modal search (text + image)
- Want GraphQL API
- Require modular architecture
- Building knowledge graphs
Our Choice: pgvector
We recommend pgvector for most enterprise RAG systems because:
1. Unified Data Model
-- Single query for vectors + metadata
SELECT
d.title,
d.content,
d.metadata->>'category' as category,
1 - (d.embedding <=> query_embedding) as similarity
FROM documents d
JOIN document_tags dt ON d.id = dt.document_id
WHERE d.status = 'published'
AND dt.tag_id = ANY(SELECT id FROM tags WHERE name IN ('AI', 'ML'))
AND d.created_at > NOW() - INTERVAL '1 year'
ORDER BY d.embedding <=> query_embedding
LIMIT 20;
2. Cost Effective
- No separate vector database needed
- Use existing PostgreSQL infrastructure
- Self-hosted option available
- 90% cheaper than managed alternatives
3. Mature Ecosystem
- Backup/restore tools
- Replication and HA
- Monitoring and observability
- ORM support (SQLAlchemy, Django ORM)
4. Performance
-- With proper indexing
CREATE INDEX idx_documents_embedding_hnsw ON documents
USING hnsw(embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Query performance: 15-30ms for 1M vectors
Embedding Strategies
Model Selection
| Model | Dimensions | Context Length | Speed | Quality | Cost/1M tokens |
|---|---|---|---|---|---|
| text-embedding-3-small | 1536 | 8191 | Fast | Good | $0.02 |
| text-embedding-3-large | 3072 | 8191 | Medium | Excellent | $0.13 |
| text-embedding-ada-002 | 1536 | 8191 | Fast | Good | $0.10 |
| bge-large-en-v1.5 | 1024 | 512 | Fast | Excellent | Free (self-hosted) |
| e5-large-v2 | 1024 | 512 | Fast | Very Good | Free (self-hosted) |
Recommendation
For most enterprise use cases: text-embedding-3-small
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
dimensions=1536 # Can truncate to 512 for faster search
)
Why?
- Best price/performance ratio
- Good quality for most domains
- Long context (8191 tokens)
- Multi-language support
- Lower storage costs
For specialized domains: Open-source models (self-hosted)
# For legal/medical/technical content
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(texts)
Embedding Optimization
1. Dimensionality Reduction
# Reduce from 1536 to 512 dimensions (faster search, lower storage)
import numpy as np
from sklearn.decomposition import PCA
def reduce_dimensions(embeddings: np.ndarray, target_dim: int = 512) -> np.ndarray:
"""Reduce embedding dimensions using PCA"""
pca = PCA(n_components=target_dim)
return pca.fit_transform(embeddings)
# Usage
full_embeddings = np.array([...]) # (N, 1536)
reduced_embeddings = reduce_dimensions(full_embeddings, 512)
Trade-offs:
- 1536 dims: Best quality, slower search
- 768 dims: Good balance
- 512 dims: Faster search, slight quality loss
- 256 dims: Fastest search, noticeable quality loss
2. Hybrid Embeddings
# Combine semantic and keyword embeddings
class HybridEmbedding:
def __init__(self):
self.semantic_model = OpenAIEmbeddings(model="text-embedding-3-small")
self.bm25 = BM25Encoder()
def embed_documents(self, texts: List[str]) -> Dict[str, np.ndarray]:
"""Generate both semantic and keyword embeddings"""
semantic = self.semantic_model.embed_documents(texts)
keyword = self.bm25.encode_documents(texts)
return {
'semantic': np.array(semantic),
'keyword': np.array(keyword)
}
3. Query Expansion
# Expand queries with related terms for better retrieval
async def expand_query(query: str, llm) -> List[str]:
"""Generate query variations"""
prompt = f"""Generate 3-5 alternative queries for: "{query}"
Consider:
- Synonyms
- Related concepts
- Different phrasings
- Broader/narrower terms
Return one query per line."""
response = await llm.generate(prompt)
variations = [line.strip() for line in response.split('
') if line.strip()]
return [query] + variations
# Usage
query_variations = await expand_query("How to implement RAG?", llm)
# Returns: [
# "How to implement RAG?",
# "Building retrieval-augmented generation systems",
# "RAG implementation guide",
# "Creating RAG applications",
# "RAG system architecture"
# ]
Chunking Techniques
Why Chunking Matters
Chunking is the most critical decision in RAG systems:
- Too small → Loss of context
- Too large → Noisy retrieval, slow generation
- Poor boundaries → Fragmented information
Chunking Strategies
1. Fixed-Size Chunking
# chunking/fixed_size.py
from typing import List, Dict
class FixedSizeChunker:
"""Split text into fixed-size chunks"""
def __init__(self, chunk_size: int = 1000, overlap: int = 200):
self.chunk_size = chunk_size
self.overlap = overlap
def chunk(self, text: str) -> List[Dict]:
"""Split text into chunks"""
chunks = []
start = 0
chunk_index = 0
while start < len(text):
end = start + self.chunk_size
chunk_text = text[start:end]
chunks.append({
'text': chunk_text,
'index': chunk_index,
'start': start,
'end': end,
'size': len(chunk_text)
})
chunk_index += 1
start = end - self.overlap
return chunks
# Pros: Simple, predictable
# Cons: May split sentences, loses context
2. Sentence-Based Chunking
# chunking/sentence.py
import re
from typing import List, Dict
class SentenceChunker:
"""Split text into sentence-based chunks"""
def __init__(self, sentences_per_chunk: int = 5, overlap: int = 1):
self.sentences_per_chunk = sentences_per_chunk
self.overlap = overlap
def chunk(self, text: str) -> List[Dict]:
"""Split text into sentence-based chunks"""
# Split into sentences
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
chunk_index = 0
i = 0
while i < len(sentences):
# Get sentences for this chunk
end = min(i + self.sentences_per_chunk, len(sentences))
chunk_sentences = sentences[i:end]
chunk_text = ' '.join(chunk_sentences)
start_char = text.find(chunk_sentences[0])
end_char = start_char + len(chunk_text)
chunks.append({
'text': chunk_text,
'index': chunk_index,
'start': start_char,
'end': end_char,
'size': len(chunk_text),
'sentence_count': len(chunk_sentences)
})
chunk_index += 1
i += self.sentences_per_chunk - self.overlap
return chunks
# Pros: Preserves sentence boundaries, better context
# Cons: Variable chunk sizes, may be too short/long
3. Semantic Chunking (Recommended)
# chunking/semantic.py
from typing import List, Dict
import numpy as np
class SemanticChunker:
"""Split text into semantically coherent chunks"""
def __init__(self, embedder, max_chunk_size: int = 1500, threshold: float = 0.7):
self.embedder = embedder
self.max_chunk_size = max_chunk_size
self.threshold = threshold
async def chunk(self, text: str) -> List[Dict]:
"""Split text into semantic chunks"""
# 1. Split into sentences
sentences = self._split_sentences(text)
# 2. Generate embeddings for each sentence
sentence_embeddings = await self.embedder.embed_documents(sentences)
# 3. Calculate similarities between consecutive sentences
similarities = [
self._cosine_similarity(sentence_embeddings[i], sentence_embeddings[i+1])
for i in range(len(sentence_embeddings) - 1)
]
# 4. Identify chunk boundaries (where similarity drops below threshold)
boundaries = [0]
for i, sim in enumerate(similarities):
if sim < self.threshold:
boundaries.append(i + 1)
boundaries.append(len(sentences))
# 5. Create chunks
chunks = []
chunk_index = 0
for i in range(len(boundaries) - 1):
start_idx = boundaries[i]
end_idx = boundaries[i+1]
# Combine sentences in this segment
chunk_sentences = sentences[start_idx:end_idx]
chunk_text = ' '.join(chunk_sentences)
# Further split if chunk is too long
if len(chunk_text) > self.max_chunk_size:
sub_chunks = self._split_long_chunk(chunk_text, self.max_chunk_size)
for sub_chunk in sub_chunks:
chunks.append({
'text': sub_chunk,
'index': chunk_index,
'type': 'semantic'
})
chunk_index += 1
else:
chunks.append({
'text': chunk_text,
'index': chunk_index,
'sentence_count': len(chunk_sentences),
'type': 'semantic'
})
chunk_index += 1
return chunks
def _split_sentences(self, text: str) -> List[str]:
"""Split text into sentences"""
import re
return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]
def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
"""Calculate cosine similarity"""
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
def _split_long_chunk(self, text: str, max_size: int) -> List[str]:
"""Split long chunk into smaller pieces"""
# Fallback to fixed-size splitting
chunks = []
start = 0
while start < len(text):
end = start + max_size
chunks.append(text[start:end])
start = end - 200 # Add overlap
return chunks
# Pros: Semantically coherent, better retrieval
# Cons: Slower (requires embeddings), more complex
4. Hierarchical Chunking
# chunking/hierarchical.py
class HierarchicalChunker:
"""Create multi-level chunk hierarchy for different use cases"""
def __init__(self, embedder):
self.embedder = embedder
async def chunk(self, text: str, document_id: str) -> Dict[str, List[Dict]]:
"""Create hierarchical chunks"""
# Level 1: Document-level (for broad queries)
doc_chunk = {
'id': f"{document_id}_doc",
'level': 'document',
'text': text[:2000], # Summary/first part
'metadata': {'type': 'document_summary'}
}
# Level 2: Section-level (for medium queries)
section_chunks = self._chunk_by_sections(text)
# Level 3: Paragraph-level (for specific queries)
paragraph_chunks = self._chunk_by_paragraphs(text)
# Level 4: Sentence-level (for precise queries)
sentence_chunks = self._chunk_by_sentences(text)
return {
'document': [doc_chunk],
'sections': section_chunks,


