RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article explains how Retrieval-Augmented Generation (RAG) enhances large language models by retrieving relevant enterprise knowledge before generating answers.
  • It outlines why RAG is especially valuable for enterprises, including access to private domain documents, reduced hallucinations, lower costs compared with fine-tuning, and improved transparency for compliance.
  • It presents a production-focused guide structure covering core system design areas such as architecture, vector database and embedding selection, chunking and retrieval optimization, and generation/synthesis.
  • It emphasizes engineering disciplines needed for real-world deployment, including evaluation/quality assurance, scaling, production rollout, and monitoring/observability, plus real-world implementation learnings from Groovy Web.

RAG Systems in Production: Building Enterprise Knowledge Search

Introduction

Retrieval-Augmented Generation (RAG) has revolutionized how enterprises build intelligent knowledge systems. By combining the power of large language models with domain-specific knowledge, RAG systems can answer questions, synthesize information, and provide insights that pure LLMs cannot achieve alone.

At Groovy Web, we've built and deployed RAG systems for Fortune 500 companies, helping them unlock the value of their organizational knowledge. This guide captures everything we've learned from building production RAG systems that serve millions of queries per month.

Table of Contents

  1. Understanding RAG Systems
  2. System Architecture
  3. Vector Database Selection
  4. Embedding Strategies
  5. Chunking Techniques
  6. Retrieval Optimization
  7. Generation and Synthesis
  8. Evaluation and Quality Assurance
  9. Scaling Considerations
  10. Production Deployment
  11. Monitoring and Observability
  12. Real-World Implementation

Understanding RAG Systems

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that enhances large language models by retrieving relevant context from a knowledge base before generating responses.

Without RAG:

User Question → LLM → Answer (Limited to training data)

With RAG:

User Question → Retrieve Relevant Documents → LLM + Context → Answer (Grounded in knowledge base)

Why RAG for Enterprise?

1. Domain-Specific Knowledge

LLMs are trained on public internet data, but enterprises have proprietary information:

  • Internal documentation
  • Product specifications
  • Customer interactions
  • Research papers
  • Compliance documents

RAG systems enable LLMs to access this private knowledge.

2. Reduced Hallucinations

By grounding responses in retrieved documents, RAG systems:

  • Cite sources
  • Provide verifiable information
  • Reduce false claims
  • Build user trust

3. Cost-Effective

Compared to fine-tuning:

  • No model training required
  • Easy to update knowledge base
  • Lower infrastructure costs
  • Faster time to production

4. Transparency and Compliance

RAG systems provide:

  • Source attribution
  • Audit trails
  • Compliance with regulations
  • Explainable AI

RAG vs Fine-Tuning

Aspect RAG Fine-Tuning
Knowledge updates Instant (add to database) Requires retraining
Cost Low ($/query) High (training costs)
Domain specificity High (source data) Medium (pattern learning)
Hallucination risk Low (grounded) Medium (model-based)
Transparency High (citations) Low (black box)
Setup time Days to weeks Weeks to months
Maintenance Ongoing indexing Periodic retraining

Best Use Cases for RAG:

  • Knowledge search and Q&A
  • Document analysis
  • Customer support automation
  • Research assistance
  • Compliance and legal review

Best Use Cases for Fine-Tuning:

  • Style and tone customization
  • Format standardization
  • Domain-specific reasoning
  • Specialized instruction following

System Architecture

End-to-End RAG Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE BASE                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Documents  │  │   Vectors   │  │  Metadata   │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘
                        │
                        │ Ingestion Pipeline
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  PROCESSING LAYER                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Chunk   │→│ Embed    │→│  Index   │→│  Store   │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────────────────────────────────────────────────────┘
                        │
                        │ Query
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  RETRIEVAL LAYER                            │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐           │
│  │  Query     │→│  Semantic  │→│  Hybrid    │           │
│  │ Embedding  │  │  Search    │  │  Search    │           │
│  └────────────┘  └────────────┘  └────────────┘           │
│                      │                                      │
│                      ▼                                      │
│              ┌──────────────┐                               │
│              │  Rerank &    │                               │
│              │  Filter      │                               │
│              └──────────────┘                               │
└─────────────────────────────────────────────────────────────┘
                        │
                        │ Context
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  GENERATION LAYER                            │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐           │
│  │  Prompt    │→│    LLM     │→│  Response  │           │
│  │  Building  │  │  Inference │  │  Synthesis │           │
│  └────────────┘  └────────────┘  └────────────┘           │
└─────────────────────────────────────────────────────────────┘
                        │
                        ▼
                    User Response

Component Breakdown

1. Ingestion Pipeline

# ingestion/pipeline.py
from typing import List, Dict
from pathlib import Path
import hashlib

class DocumentIngestionPipeline:
    """Process and ingest documents into knowledge base"""

    def __init__(self, config: Dict):
        self.chunker = DocumentChunker(config['chunking'])
        self.embedder = EmbeddingGenerator(config['embeddings'])
        self.vector_store = VectorStore(config['vector_db'])

    async def ingest_document(self, document: Dict) -> List[str]:
        """
        Ingest a document into the knowledge base

        Returns: List of chunk IDs
        """
        # 1. Extract text and metadata
        text = document['content']
        metadata = {
            'title': document['title'],
            'source': document['source'],
            'author': document.get('author'),
            'created_at': document.get('created_at'),
            'doc_type': document.get('type', 'unknown'),
            'language': document.get('language', 'en')
        }

        # 2. Split into chunks
        chunks = self.chunker.chunk(text)

        # 3. Generate embeddings
        chunk_texts = [chunk['text'] for chunk in chunks]
        embeddings = await self.embedder.generate_batch(chunk_texts)

        # 4. Prepare records for storage
        records = []
        for chunk, embedding in zip(chunks, embeddings):
            record = {
                'id': self._generate_chunk_id(document['id'], chunk['index']),
                'document_id': document['id'],
                'text': chunk['text'],
                'embedding': embedding,
                'metadata': {
                    **metadata,
                    'chunk_index': chunk['index'],
                    'chunk_size': len(chunk['text']),
                    'start_char': chunk['start'],
                    'end_char': chunk['end']
                }
            }
            records.append(record)

        # 5. Store in vector database
        chunk_ids = await self.vector_store.insert(records)

        return chunk_ids

    def _generate_chunk_id(self, doc_id: str, chunk_index: int) -> str:
        """Generate unique chunk ID"""
        hash_input = f"{doc_id}_{chunk_index}"
        return hashlib.sha256(hash_input.encode()).hexdigest()[:32]

2. Retrieval Engine

# retrieval/engine.py
from typing import List, Dict, Optional
import numpy as np

class RetrievalEngine:
    """Retrieve relevant documents for queries"""

    def __init__(self, vector_store, embedder, config: Dict):
        self.vector_store = vector_store
        self.embedder = embedder
        self.config = config
        self.reranker = Reranker(config.get('reranking'))

    async def retrieve(
        self,
        query: str,
        top_k: int = 10,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Retrieve relevant chunks for a query

        Args:
            query: User query
            top_k: Number of results to return
            filters: Metadata filters (e.g., {category: 'technology'})

        Returns:
            List of retrieved chunks with scores
        """
        # 1. Generate query embedding
        query_embedding = await self.embedder.generate(query)

        # 2. Semantic search
        results = await self.vector_store.similarity_search(
            query_embedding,
            top_k=top_k * 2,  # Retrieve more for reranking
            filters=filters
        )

        # 3. Rerank if configured
        if self.reranker and len(results) > top_k:
            results = await self.reranker.rerank(query, results, top_k)

        return results[:top_k]

    async def retrieve_with_hybrid_search(
        self,
        query: str,
        top_k: int = 10,
        alpha: float = 0.5,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Hybrid retrieval combining semantic and keyword search

        Args:
            query: User query
            top_k: Number of results
            alpha: Weight for semantic search (0-1)
            filters: Metadata filters

        Returns:
            Reranked combined results
        """
        # 1. Semantic search
        semantic_results = await self.vector_store.similarity_search(
            await self.embedder.generate(query),
            top_k=top_k * 2,
            filters=filters
        )

        # 2. Keyword search
        keyword_results = await self.vector_store.keyword_search(
            query,
            top_k=top_k * 2,
            filters=filters
        )

        # 3. Combine and rerank
        combined = self._combine_results(
            semantic_results,
            keyword_results,
            alpha
        )

        # 4. Rerank combined results
        if self.reranker:
            combined = await self.reranker.rerank(query, combined, top_k)

        return combined[:top_k]

    def _combine_results(
        self,
        semantic_results: List[Dict],
        keyword_results: List[Dict],
        alpha: float
    ) -> List[Dict]:
        """Combine semantic and keyword search results"""
        # Score normalization
        sem_scores = np.array([r['score'] for r in semantic_results])
        key_scores = np.array([r['score'] for r in keyword_results])

        sem_normalized = (sem_scores - sem_scores.min()) / (sem_scores.max() - sem_scores.min())
        key_normalized = (key_scores - key_scores.min()) / (key_scores.max() - key_scores.min())

        # Combine scores
        for i, result in enumerate(semantic_results):
            result['combined_score'] = alpha * sem_normalized[i]

        for i, result in enumerate(keyword_results):
            result['combined_score'] += (1 - alpha) * key_normalized[i]

        # Merge and sort by combined score
        seen = set()
        combined = []
        for result in semantic_results + keyword_results:
            if result['id'] not in seen:
                seen.add(result['id'])
                combined.append(result)

        combined.sort(key=lambda x: x['combined_score'], reverse=True)
        return combined

3. Response Generator

# generation/generator.py
from typing import List, Dict
import openai

class ResponseGenerator:
    """Generate responses using retrieved context"""

    def __init__(self, config: Dict):
        self.client = openai.AsyncClient(api_key=config['api_key'])
        self.model = config['model']
        self.temperature = config.get('temperature', 0.3)
        self.max_tokens = config.get('max_tokens', 1000)

    async def generate_response(
        self,
        query: str,
        context: List[Dict],
        conversation_history: Optional[List[Dict]] = None
    ) -> Dict:
        """
        Generate response using retrieved context

        Args:
            query: User query
            context: Retrieved chunks
            conversation_history: Previous messages (for chat)

        Returns:
            Generated response with citations
        """
        # 1. Build prompt with context
        prompt = self._build_prompt(query, context)

        # 2. Generate response
        messages = self._build_messages(prompt, conversation_history)

        response = await self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=self.temperature,
            max_tokens=self.max_tokens
        )

        # 3. Extract response and citations
        answer = response.choices[0].message.content
        citations = self._extract_citations(response, context)

        return {
            'answer': answer,
            'citations': citations,
            'sources': self._get_unique_sources(context),
            'model': self.model,
            'tokens_used': response.usage.total_tokens
        }

    def _build_prompt(self, query: str, context: List[Dict]) -> str:
        """Build prompt with context"""
        context_str = "

".join([
            f"[Source {i+1}]
{chunk['text']}"
            for i, chunk in enumerate(context)
        ])

        prompt = f"""You are a helpful assistant that answers questions based on the provided context.

Context:
{context_str}

Question: {query}

Instructions:
1. Answer the question using only the provided context
2. If the context doesn't contain enough information, say so
3. Cite sources using [Source X] notation
4. Be concise and accurate
5. If asked for sources, provide them

Answer:"""

        return prompt

    def _build_messages(
        self,
        prompt: str,
        history: Optional[List[Dict]] = None
    ) -> List[Dict]:
        """Build message list for API"""
        messages = []

        if history:
            messages.extend(history)

        messages.append({
            "role": "user",
            "content": prompt
        })

        return messages

    def _extract_citations(
        self,
        response: openai.ChatCompletion,
        context: List[Dict]
    ) -> List[Dict]:
        """Extract citations from response"""
        answer = response.choices[0].message.content

        # Find source references like [Source 1], [Source 2], etc.
        import re
        citations = re.findall(r'\[Source (\d+)\]', answer)

        # Map to actual source chunks
        unique_citations = []
        for citation in set(citations):
            idx = int(citation) - 1  # Convert to 0-based index
            if idx < len(context):
                unique_citations.append({
                    'index': int(citation),
                    'chunk_id': context[idx]['id'],
                    'document_id': context[idx]['metadata']['document_id'],
                    'title': context[idx]['metadata']['title'],
                    'source': context[idx]['metadata']['source']
                })

        return unique_citations

    def _get_unique_sources(self, context: List[Dict]) -> List[Dict]:
        """Get unique sources from context"""
        seen = set()
        sources = []

        for chunk in context:
            doc_id = chunk['metadata']['document_id']
            if doc_id not in seen:
                seen.add(doc_id)
                sources.append({
                    'document_id': doc_id,
                    'title': chunk['metadata']['title'],
                    'source': chunk['metadata']['source'],
                    'author': chunk['metadata'].get('author'),
                    'created_at': chunk['metadata'].get('created_at')
                })

        return sources

Vector Database Selection

Comparison Matrix

Database Open Source Cloud Managed Performance Scalability Features Cost
pgvector ✅ (Supabase, etc.) ⭐⭐⭐⭐ ⭐⭐⭐⭐ Relational DB + vectors $
Pinecone ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Purpose-built, easy $$$
Weaviate ⭐⭐⭐⭐ ⭐⭐⭐⭐ GraphQL, multi-modal $$
Qdrant ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Filter optimization, hybrid $$
Milvus ✅ (Zilliz) ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ Distributed, cloud-native $$
Chroma ⭐⭐⭐ ⭐⭐⭐ Simple, embedded Free

Selection Criteria

Choose pgvector if:

  • Already using PostgreSQL
  • Need ACID transactions
  • Want to minimize infrastructure
  • Budget-conscious
  • Need SQL joins with vector search

Choose Pinecone if:

  • Want fully managed solution
  • Need auto-scaling
  • Prioritize ease of setup
  • Have budget for managed service
  • Want fastest time to production

Choose Qdrant if:

  • Need advanced filtering
  • Want hybrid search capabilities
  • Require high performance
  • Prefer open-source with managed option

Choose Weaviate if:

  • Need multi-modal search (text + image)
  • Want GraphQL API
  • Require modular architecture
  • Building knowledge graphs

Our Choice: pgvector

We recommend pgvector for most enterprise RAG systems because:

1. Unified Data Model

-- Single query for vectors + metadata
SELECT
  d.title,
  d.content,
  d.metadata->>'category' as category,
  1 - (d.embedding <=> query_embedding) as similarity
FROM documents d
JOIN document_tags dt ON d.id = dt.document_id
WHERE d.status = 'published'
  AND dt.tag_id = ANY(SELECT id FROM tags WHERE name IN ('AI', 'ML'))
  AND d.created_at > NOW() - INTERVAL '1 year'
ORDER BY d.embedding <=> query_embedding
LIMIT 20;

2. Cost Effective

  • No separate vector database needed
  • Use existing PostgreSQL infrastructure
  • Self-hosted option available
  • 90% cheaper than managed alternatives

3. Mature Ecosystem

  • Backup/restore tools
  • Replication and HA
  • Monitoring and observability
  • ORM support (SQLAlchemy, Django ORM)

4. Performance

-- With proper indexing
CREATE INDEX idx_documents_embedding_hnsw ON documents
  USING hnsw(embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Query performance: 15-30ms for 1M vectors

Embedding Strategies

Model Selection

Model Dimensions Context Length Speed Quality Cost/1M tokens
text-embedding-3-small 1536 8191 Fast Good $0.02
text-embedding-3-large 3072 8191 Medium Excellent $0.13
text-embedding-ada-002 1536 8191 Fast Good $0.10
bge-large-en-v1.5 1024 512 Fast Excellent Free (self-hosted)
e5-large-v2 1024 512 Fast Very Good Free (self-hosted)

Recommendation

For most enterprise use cases: text-embedding-3-small

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536  # Can truncate to 512 for faster search
)

Why?

  • Best price/performance ratio
  • Good quality for most domains
  • Long context (8191 tokens)
  • Multi-language support
  • Lower storage costs

For specialized domains: Open-source models (self-hosted)

# For legal/medical/technical content
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(texts)

Embedding Optimization

1. Dimensionality Reduction

# Reduce from 1536 to 512 dimensions (faster search, lower storage)
import numpy as np
from sklearn.decomposition import PCA

def reduce_dimensions(embeddings: np.ndarray, target_dim: int = 512) -> np.ndarray:
    """Reduce embedding dimensions using PCA"""
    pca = PCA(n_components=target_dim)
    return pca.fit_transform(embeddings)

# Usage
full_embeddings = np.array([...])  # (N, 1536)
reduced_embeddings = reduce_dimensions(full_embeddings, 512)

Trade-offs:

  • 1536 dims: Best quality, slower search
  • 768 dims: Good balance
  • 512 dims: Faster search, slight quality loss
  • 256 dims: Fastest search, noticeable quality loss

2. Hybrid Embeddings

# Combine semantic and keyword embeddings
class HybridEmbedding:
    def __init__(self):
        self.semantic_model = OpenAIEmbeddings(model="text-embedding-3-small")
        self.bm25 = BM25Encoder()

    def embed_documents(self, texts: List[str]) -> Dict[str, np.ndarray]:
        """Generate both semantic and keyword embeddings"""
        semantic = self.semantic_model.embed_documents(texts)
        keyword = self.bm25.encode_documents(texts)

        return {
            'semantic': np.array(semantic),
            'keyword': np.array(keyword)
        }

3. Query Expansion

# Expand queries with related terms for better retrieval
async def expand_query(query: str, llm) -> List[str]:
    """Generate query variations"""
    prompt = f"""Generate 3-5 alternative queries for: "{query}"

    Consider:
    - Synonyms
    - Related concepts
    - Different phrasings
    - Broader/narrower terms

    Return one query per line."""

    response = await llm.generate(prompt)
    variations = [line.strip() for line in response.split('
') if line.strip()]

    return [query] + variations

# Usage
query_variations = await expand_query("How to implement RAG?", llm)
# Returns: [
#   "How to implement RAG?",
#   "Building retrieval-augmented generation systems",
#   "RAG implementation guide",
#   "Creating RAG applications",
#   "RAG system architecture"
# ]

Chunking Techniques

Why Chunking Matters

Chunking is the most critical decision in RAG systems:

  • Too small → Loss of context
  • Too large → Noisy retrieval, slow generation
  • Poor boundaries → Fragmented information

Chunking Strategies

1. Fixed-Size Chunking

# chunking/fixed_size.py
from typing import List, Dict

class FixedSizeChunker:
    """Split text into fixed-size chunks"""

    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str) -> List[Dict]:
        """Split text into chunks"""
        chunks = []
        start = 0
        chunk_index = 0

        while start < len(text):
            end = start + self.chunk_size
            chunk_text = text[start:end]

            chunks.append({
                'text': chunk_text,
                'index': chunk_index,
                'start': start,
                'end': end,
                'size': len(chunk_text)
            })

            chunk_index += 1
            start = end - self.overlap

        return chunks

# Pros: Simple, predictable
# Cons: May split sentences, loses context

2. Sentence-Based Chunking

# chunking/sentence.py
import re
from typing import List, Dict

class SentenceChunker:
    """Split text into sentence-based chunks"""

    def __init__(self, sentences_per_chunk: int = 5, overlap: int = 1):
        self.sentences_per_chunk = sentences_per_chunk
        self.overlap = overlap

    def chunk(self, text: str) -> List[Dict]:
        """Split text into sentence-based chunks"""
        # Split into sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)

        chunks = []
        chunk_index = 0
        i = 0

        while i < len(sentences):
            # Get sentences for this chunk
            end = min(i + self.sentences_per_chunk, len(sentences))
            chunk_sentences = sentences[i:end]
            chunk_text = ' '.join(chunk_sentences)

            start_char = text.find(chunk_sentences[0])
            end_char = start_char + len(chunk_text)

            chunks.append({
                'text': chunk_text,
                'index': chunk_index,
                'start': start_char,
                'end': end_char,
                'size': len(chunk_text),
                'sentence_count': len(chunk_sentences)
            })

            chunk_index += 1
            i += self.sentences_per_chunk - self.overlap

        return chunks

# Pros: Preserves sentence boundaries, better context
# Cons: Variable chunk sizes, may be too short/long

3. Semantic Chunking (Recommended)

# chunking/semantic.py
from typing import List, Dict
import numpy as np

class SemanticChunker:
    """Split text into semantically coherent chunks"""

    def __init__(self, embedder, max_chunk_size: int = 1500, threshold: float = 0.7):
        self.embedder = embedder
        self.max_chunk_size = max_chunk_size
        self.threshold = threshold

    async def chunk(self, text: str) -> List[Dict]:
        """Split text into semantic chunks"""
        # 1. Split into sentences
        sentences = self._split_sentences(text)

        # 2. Generate embeddings for each sentence
        sentence_embeddings = await self.embedder.embed_documents(sentences)

        # 3. Calculate similarities between consecutive sentences
        similarities = [
            self._cosine_similarity(sentence_embeddings[i], sentence_embeddings[i+1])
            for i in range(len(sentence_embeddings) - 1)
        ]

        # 4. Identify chunk boundaries (where similarity drops below threshold)
        boundaries = [0]
        for i, sim in enumerate(similarities):
            if sim < self.threshold:
                boundaries.append(i + 1)
        boundaries.append(len(sentences))

        # 5. Create chunks
        chunks = []
        chunk_index = 0

        for i in range(len(boundaries) - 1):
            start_idx = boundaries[i]
            end_idx = boundaries[i+1]

            # Combine sentences in this segment
            chunk_sentences = sentences[start_idx:end_idx]
            chunk_text = ' '.join(chunk_sentences)

            # Further split if chunk is too long
            if len(chunk_text) > self.max_chunk_size:
                sub_chunks = self._split_long_chunk(chunk_text, self.max_chunk_size)
                for sub_chunk in sub_chunks:
                    chunks.append({
                        'text': sub_chunk,
                        'index': chunk_index,
                        'type': 'semantic'
                    })
                    chunk_index += 1
            else:
                chunks.append({
                    'text': chunk_text,
                    'index': chunk_index,
                    'sentence_count': len(chunk_sentences),
                    'type': 'semantic'
                })
                chunk_index += 1

        return chunks

    def _split_sentences(self, text: str) -> List[str]:
        """Split text into sentences"""
        import re
        return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]

    def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
        """Calculate cosine similarity"""
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

    def _split_long_chunk(self, text: str, max_size: int) -> List[str]:
        """Split long chunk into smaller pieces"""
        # Fallback to fixed-size splitting
        chunks = []
        start = 0
        while start < len(text):
            end = start + max_size
            chunks.append(text[start:end])
            start = end - 200  # Add overlap
        return chunks

# Pros: Semantically coherent, better retrieval
# Cons: Slower (requires embeddings), more complex

4. Hierarchical Chunking

# chunking/hierarchical.py
class HierarchicalChunker:
    """Create multi-level chunk hierarchy for different use cases"""

    def __init__(self, embedder):
        self.embedder = embedder

    async def chunk(self, text: str, document_id: str) -> Dict[str, List[Dict]]:
        """Create hierarchical chunks"""
        # Level 1: Document-level (for broad queries)
        doc_chunk = {
            'id': f"{document_id}_doc",
            'level': 'document',
            'text': text[:2000],  # Summary/first part
            'metadata': {'type': 'document_summary'}
        }

        # Level 2: Section-level (for medium queries)
        section_chunks = self._chunk_by_sections(text)

        # Level 3: Paragraph-level (for specific queries)
        paragraph_chunks = self._chunk_by_paragraphs(text)

        # Level 4: Sentence-level (for precise queries)
        sentence_chunks = self._chunk_by_sentences(text)

        return {
            'document': [doc_chunk],
            'sections': section_chunks,