RAG Systems in Production: Building Enterprise Knowledge Search

Dev.to / 4/23/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article explains how Retrieval-Augmented Generation (RAG) enhances large language models by retrieving relevant enterprise knowledge before generating answers.
It outlines why RAG is especially valuable for enterprises, including access to private domain documents, reduced hallucinations, lower costs compared with fine-tuning, and improved transparency for compliance.
It presents a production-focused guide structure covering core system design areas such as architecture, vector database and embedding selection, chunking and retrieval optimization, and generation/synthesis.
It emphasizes engineering disciplines needed for real-world deployment, including evaluation/quality assurance, scaling, production rollout, and monitoring/observability, plus real-world implementation learnings from Groovy Web.

RAG Systems in Production: Building Enterprise Knowledge Search

Introduction

Retrieval-Augmented Generation (RAG) has revolutionized how enterprises build intelligent knowledge systems. By combining the power of large language models with domain-specific knowledge, RAG systems can answer questions, synthesize information, and provide insights that pure LLMs cannot achieve alone.

At Groovy Web, we've built and deployed RAG systems for Fortune 500 companies, helping them unlock the value of their organizational knowledge. This guide captures everything we've learned from building production RAG systems that serve millions of queries per month.

Understanding RAG Systems
System Architecture
Vector Database Selection
Embedding Strategies
Chunking Techniques
Retrieval Optimization
Generation and Synthesis
Evaluation and Quality Assurance
Scaling Considerations
Production Deployment
Monitoring and Observability
Real-World Implementation

Understanding RAG Systems

What is RAG?

RAG (Retrieval-Augmented Generation) is a technique that enhances large language models by retrieving relevant context from a knowledge base before generating responses.

Without RAG:

User Question → LLM → Answer (Limited to training data)

With RAG:

User Question → Retrieve Relevant Documents → LLM + Context → Answer (Grounded in knowledge base)

Why RAG for Enterprise?

1. Domain-Specific Knowledge

LLMs are trained on public internet data, but enterprises have proprietary information:

Internal documentation
Product specifications
Customer interactions
Research papers
Compliance documents

RAG systems enable LLMs to access this private knowledge.

2. Reduced Hallucinations

By grounding responses in retrieved documents, RAG systems:

Cite sources
Provide verifiable information
Reduce false claims
Build user trust

3. Cost-Effective

Compared to fine-tuning:

No model training required
Easy to update knowledge base
Lower infrastructure costs
Faster time to production

4. Transparency and Compliance

RAG systems provide:

Source attribution
Audit trails
Compliance with regulations
Explainable AI

RAG vs Fine-Tuning

Aspect	RAG	Fine-Tuning
Knowledge updates	Instant (add to database)	Requires retraining
Cost	Low ($/query)	High (training costs)
Domain specificity	High (source data)	Medium (pattern learning)
Hallucination risk	Low (grounded)	Medium (model-based)
Transparency	High (citations)	Low (black box)
Setup time	Days to weeks	Weeks to months
Maintenance	Ongoing indexing	Periodic retraining

Best Use Cases for RAG:

Knowledge search and Q&A
Document analysis
Customer support automation
Research assistance
Compliance and legal review

Best Use Cases for Fine-Tuning:

Style and tone customization
Format standardization
Domain-specific reasoning
Specialized instruction following

System Architecture

End-to-End RAG Pipeline

┌─────────────────────────────────────────────────────────────┐
│                    KNOWLEDGE BASE                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Documents  │  │   Vectors   │  │  Metadata   │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘
                        │
                        │ Ingestion Pipeline
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  PROCESSING LAYER                            │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  Chunk   │→│ Embed    │→│  Index   │→│  Store   │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────────────────────────────────────────────────────┘
                        │
                        │ Query
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  RETRIEVAL LAYER                            │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐           │
│  │  Query     │→│  Semantic  │→│  Hybrid    │           │
│  │ Embedding  │  │  Search    │  │  Search    │           │
│  └────────────┘  └────────────┘  └────────────┘           │
│                      │                                      │
│                      ▼                                      │
│              ┌──────────────┐                               │
│              │  Rerank &    │                               │
│              │  Filter      │                               │
│              └──────────────┘                               │
└─────────────────────────────────────────────────────────────┘
                        │
                        │ Context
                        ▼
┌─────────────────────────────────────────────────────────────┐
│                  GENERATION LAYER                            │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐           │
│  │  Prompt    │→│    LLM     │→│  Response  │           │
│  │  Building  │  │  Inference │  │  Synthesis │           │
│  └────────────┘  └────────────┘  └────────────┘           │
└─────────────────────────────────────────────────────────────┘
                        │
                        ▼
                    User Response

Component Breakdown

1. Ingestion Pipeline

# ingestion/pipeline.py
from typing import List, Dict
from pathlib import Path
import hashlib

class DocumentIngestionPipeline:
    """Process and ingest documents into knowledge base"""

    def __init__(self, config: Dict):
        self.chunker = DocumentChunker(config['chunking'])
        self.embedder = EmbeddingGenerator(config['embeddings'])
        self.vector_store = VectorStore(config['vector_db'])

    async def ingest_document(self, document: Dict) -> List[str]:
        """
        Ingest a document into the knowledge base

        Returns: List of chunk IDs
        """
        # 1. Extract text and metadata
        text = document['content']
        metadata = {
            'title': document['title'],
            'source': document['source'],
            'author': document.get('author'),
            'created_at': document.get('created_at'),
            'doc_type': document.get('type', 'unknown'),
            'language': document.get('language', 'en')
        }

        # 2. Split into chunks
        chunks = self.chunker.chunk(text)

        # 3. Generate embeddings
        chunk_texts = [chunk['text'] for chunk in chunks]
        embeddings = await self.embedder.generate_batch(chunk_texts)

        # 4. Prepare records for storage
        records = []
        for chunk, embedding in zip(chunks, embeddings):
            record = {
                'id': self._generate_chunk_id(document['id'], chunk['index']),
                'document_id': document['id'],
                'text': chunk['text'],
                'embedding': embedding,
                'metadata': {
                    **metadata,
                    'chunk_index': chunk['index'],
                    'chunk_size': len(chunk['text']),
                    'start_char': chunk['start'],
                    'end_char': chunk['end']
                }
            }
            records.append(record)

        # 5. Store in vector database
        chunk_ids = await self.vector_store.insert(records)

        return chunk_ids

    def _generate_chunk_id(self, doc_id: str, chunk_index: int) -> str:
        """Generate unique chunk ID"""
        hash_input = f"{doc_id}_{chunk_index}"
        return hashlib.sha256(hash_input.encode()).hexdigest()[:32]

2. Retrieval Engine

# retrieval/engine.py
from typing import List, Dict, Optional
import numpy as np

class RetrievalEngine:
    """Retrieve relevant documents for queries"""

    def __init__(self, vector_store, embedder, config: Dict):
        self.vector_store = vector_store
        self.embedder = embedder
        self.config = config
        self.reranker = Reranker(config.get('reranking'))

    async def retrieve(
        self,
        query: str,
        top_k: int = 10,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Retrieve relevant chunks for a query

        Args:
            query: User query
            top_k: Number of results to return
            filters: Metadata filters (e.g., {category: 'technology'})

        Returns:
            List of retrieved chunks with scores
        """
        # 1. Generate query embedding
        query_embedding = await self.embedder.generate(query)

        # 2. Semantic search
        results = await self.vector_store.similarity_search(
            query_embedding,
            top_k=top_k * 2,  # Retrieve more for reranking
            filters=filters
        )

        # 3. Rerank if configured
        if self.reranker and len(results) > top_k:
            results = await self.reranker.rerank(query, results, top_k)

        return results[:top_k]

    async def retrieve_with_hybrid_search(
        self,
        query: str,
        top_k: int = 10,
        alpha: float = 0.5,
        filters: Optional[Dict] = None
    ) -> List[Dict]:
        """
        Hybrid retrieval combining semantic and keyword search

        Args:
            query: User query
            top_k: Number of results
            alpha: Weight for semantic search (0-1)
            filters: Metadata filters

        Returns:
            Reranked combined results
        """
        # 1. Semantic search
        semantic_results = await self.vector_store.similarity_search(
            await self.embedder.generate(query),
            top_k=top_k * 2,
            filters=filters
        )

        # 2. Keyword search
        keyword_results = await self.vector_store.keyword_search(
            query,
            top_k=top_k * 2,
            filters=filters
        )

        # 3. Combine and rerank
        combined = self._combine_results(
            semantic_results,
            keyword_results,
            alpha
        )

        # 4. Rerank combined results
        if self.reranker:
            combined = await self.reranker.rerank(query, combined, top_k)

        return combined[:top_k]

    def _combine_results(
        self,
        semantic_results: List[Dict],
        keyword_results: List[Dict],
        alpha: float
    ) -> List[Dict]:
        """Combine semantic and keyword search results"""
        # Score normalization
        sem_scores = np.array([r['score'] for r in semantic_results])
        key_scores = np.array([r['score'] for r in keyword_results])

        sem_normalized = (sem_scores - sem_scores.min()) / (sem_scores.max() - sem_scores.min())
        key_normalized = (key_scores - key_scores.min()) / (key_scores.max() - key_scores.min())

        # Combine scores
        for i, result in enumerate(semantic_results):
            result['combined_score'] = alpha * sem_normalized[i]

        for i, result in enumerate(keyword_results):
            result['combined_score'] += (1 - alpha) * key_normalized[i]

        # Merge and sort by combined score
        seen = set()
        combined = []
        for result in semantic_results + keyword_results:
            if result['id'] not in seen:
                seen.add(result['id'])
                combined.append(result)

        combined.sort(key=lambda x: x['combined_score'], reverse=True)
        return combined

3. Response Generator

# generation/generator.py
from typing import List, Dict
import openai

class ResponseGenerator:
    """Generate responses using retrieved context"""

    def __init__(self, config: Dict):
        self.client = openai.AsyncClient(api_key=config['api_key'])
        self.model = config['model']
        self.temperature = config.get('temperature', 0.3)
        self.max_tokens = config.get('max_tokens', 1000)

    async def generate_response(
        self,
        query: str,
        context: List[Dict],
        conversation_history: Optional[List[Dict]] = None
    ) -> Dict:
        """
        Generate response using retrieved context

        Args:
            query: User query
            context: Retrieved chunks
            conversation_history: Previous messages (for chat)

        Returns:
            Generated response with citations
        """
        # 1. Build prompt with context
        prompt = self._build_prompt(query, context)

        # 2. Generate response
        messages = self._build_messages(prompt, conversation_history)

        response = await self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=self.temperature,
            max_tokens=self.max_tokens
        )

        # 3. Extract response and citations
        answer = response.choices[0].message.content
        citations = self._extract_citations(response, context)

        return {
            'answer': answer,
            'citations': citations,
            'sources': self._get_unique_sources(context),
            'model': self.model,
            'tokens_used': response.usage.total_tokens
        }

    def _build_prompt(self, query: str, context: List[Dict]) -> str:
        """Build prompt with context"""
        context_str = "

".join([
            f"[Source {i+1}]
{chunk['text']}"
            for i, chunk in enumerate(context)
        ])

        prompt = f"""You are a helpful assistant that answers questions based on the provided context.

Context:
{context_str}

Question: {query}

Instructions:
1. Answer the question using only the provided context
2. If the context doesn't contain enough information, say so
3. Cite sources using [Source X] notation
4. Be concise and accurate
5. If asked for sources, provide them

Answer:"""

        return prompt

    def _build_messages(
        self,
        prompt: str,
        history: Optional[List[Dict]] = None
    ) -> List[Dict]:
        """Build message list for API"""
        messages = []

        if history:
            messages.extend(history)

        messages.append({
            "role": "user",
            "content": prompt
        })

        return messages

    def _extract_citations(
        self,
        response: openai.ChatCompletion,
        context: List[Dict]
    ) -> List[Dict]:
        """Extract citations from response"""
        answer = response.choices[0].message.content

        # Find source references like [Source 1], [Source 2], etc.
        import re
        citations = re.findall(r'\[Source (\d+)\]', answer)

        # Map to actual source chunks
        unique_citations = []
        for citation in set(citations):
            idx = int(citation) - 1  # Convert to 0-based index
            if idx < len(context):
                unique_citations.append({
                    'index': int(citation),
                    'chunk_id': context[idx]['id'],
                    'document_id': context[idx]['metadata']['document_id'],
                    'title': context[idx]['metadata']['title'],
                    'source': context[idx]['metadata']['source']
                })

        return unique_citations

    def _get_unique_sources(self, context: List[Dict]) -> List[Dict]:
        """Get unique sources from context"""
        seen = set()
        sources = []

        for chunk in context:
            doc_id = chunk['metadata']['document_id']
            if doc_id not in seen:
                seen.add(doc_id)
                sources.append({
                    'document_id': doc_id,
                    'title': chunk['metadata']['title'],
                    'source': chunk['metadata']['source'],
                    'author': chunk['metadata'].get('author'),
                    'created_at': chunk['metadata'].get('created_at')
                })

        return sources

Vector Database Selection

Comparison Matrix

Database	Open Source	Cloud Managed	Performance	Scalability	Features	Cost
pgvector	✅	✅ (Supabase, etc.)	⭐⭐⭐⭐	⭐⭐⭐⭐	Relational DB + vectors	$
Pinecone	❌	✅	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Purpose-built, easy	$$$
Weaviate	✅	✅	⭐⭐⭐⭐	⭐⭐⭐⭐	GraphQL, multi-modal	$$
Qdrant	✅	✅	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Filter optimization, hybrid	$$
Milvus	✅	✅ (Zilliz)	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Distributed, cloud-native	$$
Chroma	✅	❌	⭐⭐⭐	⭐⭐⭐	Simple, embedded	Free

Selection Criteria

Choose pgvector if:

Already using PostgreSQL
Need ACID transactions
Want to minimize infrastructure
Budget-conscious
Need SQL joins with vector search

Choose Pinecone if:

Want fully managed solution
Need auto-scaling
Prioritize ease of setup
Have budget for managed service
Want fastest time to production

Choose Qdrant if:

Need advanced filtering
Want hybrid search capabilities
Require high performance
Prefer open-source with managed option

Choose Weaviate if:

Need multi-modal search (text + image)
Want GraphQL API
Require modular architecture
Building knowledge graphs

Our Choice: pgvector

We recommend pgvector for most enterprise RAG systems because:

1. Unified Data Model

-- Single query for vectors + metadata
SELECT
  d.title,
  d.content,
  d.metadata->>'category' as category,
  1 - (d.embedding <=> query_embedding) as similarity
FROM documents d
JOIN document_tags dt ON d.id = dt.document_id
WHERE d.status = 'published'
  AND dt.tag_id = ANY(SELECT id FROM tags WHERE name IN ('AI', 'ML'))
  AND d.created_at > NOW() - INTERVAL '1 year'
ORDER BY d.embedding <=> query_embedding
LIMIT 20;

2. Cost Effective

No separate vector database needed
Use existing PostgreSQL infrastructure
Self-hosted option available
90% cheaper than managed alternatives

3. Mature Ecosystem

Backup/restore tools
Replication and HA
Monitoring and observability
ORM support (SQLAlchemy, Django ORM)

4. Performance

-- With proper indexing
CREATE INDEX idx_documents_embedding_hnsw ON documents
  USING hnsw(embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Query performance: 15-30ms for 1M vectors

Embedding Strategies

Model Selection

Model	Dimensions	Context Length	Speed	Quality	Cost/1M tokens
text-embedding-3-small	1536	8191	Fast	Good	$0.02
text-embedding-3-large	3072	8191	Medium	Excellent	$0.13
text-embedding-ada-002	1536	8191	Fast	Good	$0.10
bge-large-en-v1.5	1024	512	Fast	Excellent	Free (self-hosted)
e5-large-v2	1024	512	Fast	Very Good	Free (self-hosted)

Recommendation

For most enterprise use cases: text-embedding-3-small

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    dimensions=1536  # Can truncate to 512 for faster search
)

Why?

Best price/performance ratio
Good quality for most domains
Long context (8191 tokens)
Multi-language support
Lower storage costs

For specialized domains: Open-source models (self-hosted)

# For legal/medical/technical content
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embeddings = model.encode(texts)

Embedding Optimization

1. Dimensionality Reduction

# Reduce from 1536 to 512 dimensions (faster search, lower storage)
import numpy as np
from sklearn.decomposition import PCA

def reduce_dimensions(embeddings: np.ndarray, target_dim: int = 512) -> np.ndarray:
    """Reduce embedding dimensions using PCA"""
    pca = PCA(n_components=target_dim)
    return pca.fit_transform(embeddings)

# Usage
full_embeddings = np.array([...])  # (N, 1536)
reduced_embeddings = reduce_dimensions(full_embeddings, 512)

Trade-offs:

1536 dims: Best quality, slower search
768 dims: Good balance
512 dims: Faster search, slight quality loss
256 dims: Fastest search, noticeable quality loss

2. Hybrid Embeddings

# Combine semantic and keyword embeddings
class HybridEmbedding:
    def __init__(self):
        self.semantic_model = OpenAIEmbeddings(model="text-embedding-3-small")
        self.bm25 = BM25Encoder()

    def embed_documents(self, texts: List[str]) -> Dict[str, np.ndarray]:
        """Generate both semantic and keyword embeddings"""
        semantic = self.semantic_model.embed_documents(texts)
        keyword = self.bm25.encode_documents(texts)

        return {
            'semantic': np.array(semantic),
            'keyword': np.array(keyword)
        }

3. Query Expansion

# Expand queries with related terms for better retrieval
async def expand_query(query: str, llm) -> List[str]:
    """Generate query variations"""
    prompt = f"""Generate 3-5 alternative queries for: "{query}"

    Consider:
    - Synonyms
    - Related concepts
    - Different phrasings
    - Broader/narrower terms

    Return one query per line."""

    response = await llm.generate(prompt)
    variations = [line.strip() for line in response.split('
') if line.strip()]

    return [query] + variations

# Usage
query_variations = await expand_query("How to implement RAG?", llm)
# Returns: [
#   "How to implement RAG?",
#   "Building retrieval-augmented generation systems",
#   "RAG implementation guide",
#   "Creating RAG applications",
#   "RAG system architecture"
# ]

Chunking Techniques

Why Chunking Matters

Chunking is the most critical decision in RAG systems:

Too small → Loss of context
Too large → Noisy retrieval, slow generation
Poor boundaries → Fragmented information

Chunking Strategies

1. Fixed-Size Chunking

# chunking/fixed_size.py
from typing import List, Dict

class FixedSizeChunker:
    """Split text into fixed-size chunks"""

    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk(self, text: str) -> List[Dict]:
        """Split text into chunks"""
        chunks = []
        start = 0
        chunk_index = 0

        while start < len(text):
            end = start + self.chunk_size
            chunk_text = text[start:end]

            chunks.append({
                'text': chunk_text,
                'index': chunk_index,
                'start': start,
                'end': end,
                'size': len(chunk_text)
            })

            chunk_index += 1
            start = end - self.overlap

        return chunks

# Pros: Simple, predictable
# Cons: May split sentences, loses context

2. Sentence-Based Chunking

# chunking/sentence.py
import re
from typing import List, Dict

class SentenceChunker:
    """Split text into sentence-based chunks"""

    def __init__(self, sentences_per_chunk: int = 5, overlap: int = 1):
        self.sentences_per_chunk = sentences_per_chunk
        self.overlap = overlap

    def chunk(self, text: str) -> List[Dict]:
        """Split text into sentence-based chunks"""
        # Split into sentences
        sentences = re.split(r'(?<=[.!?])\s+', text)

        chunks = []
        chunk_index = 0
        i = 0

        while i < len(sentences):
            # Get sentences for this chunk
            end = min(i + self.sentences_per_chunk, len(sentences))
            chunk_sentences = sentences[i:end]
            chunk_text = ' '.join(chunk_sentences)

            start_char = text.find(chunk_sentences[0])
            end_char = start_char + len(chunk_text)

            chunks.append({
                'text': chunk_text,
                'index': chunk_index,
                'start': start_char,
                'end': end_char,
                'size': len(chunk_text),
                'sentence_count': len(chunk_sentences)
            })

            chunk_index += 1
            i += self.sentences_per_chunk - self.overlap

        return chunks

# Pros: Preserves sentence boundaries, better context
# Cons: Variable chunk sizes, may be too short/long

3. Semantic Chunking (Recommended)

# chunking/semantic.py
from typing import List, Dict
import numpy as np

class SemanticChunker:
    """Split text into semantically coherent chunks"""

    def __init__(self, embedder, max_chunk_size: int = 1500, threshold: float = 0.7):
        self.embedder = embedder
        self.max_chunk_size = max_chunk_size
        self.threshold = threshold

    async def chunk(self, text: str) -> List[Dict]:
        """Split text into semantic chunks"""
        # 1. Split into sentences
        sentences = self._split_sentences(text)

        # 2. Generate embeddings for each sentence
        sentence_embeddings = await self.embedder.embed_documents(sentences)

        # 3. Calculate similarities between consecutive sentences
        similarities = [
            self._cosine_similarity(sentence_embeddings[i], sentence_embeddings[i+1])
            for i in range(len(sentence_embeddings) - 1)
        ]

        # 4. Identify chunk boundaries (where similarity drops below threshold)
        boundaries = [0]
        for i, sim in enumerate(similarities):
            if sim < self.threshold:
                boundaries.append(i + 1)
        boundaries.append(len(sentences))

        # 5. Create chunks
        chunks = []
        chunk_index = 0

        for i in range(len(boundaries) - 1):
            start_idx = boundaries[i]
            end_idx = boundaries[i+1]

            # Combine sentences in this segment
            chunk_sentences = sentences[start_idx:end_idx]
            chunk_text = ' '.join(chunk_sentences)

            # Further split if chunk is too long
            if len(chunk_text) > self.max_chunk_size:
                sub_chunks = self._split_long_chunk(chunk_text, self.max_chunk_size)
                for sub_chunk in sub_chunks:
                    chunks.append({
                        'text': sub_chunk,
                        'index': chunk_index,
                        'type': 'semantic'
                    })
                    chunk_index += 1
            else:
                chunks.append({
                    'text': chunk_text,
                    'index': chunk_index,
                    'sentence_count': len(chunk_sentences),
                    'type': 'semantic'
                })
                chunk_index += 1

        return chunks

    def _split_sentences(self, text: str) -> List[str]:
        """Split text into sentences"""
        import re
        return [s.strip() for s in re.split(r'(?<=[.!?])\s+', text) if s.strip()]

    def _cosine_similarity(self, vec1: np.ndarray, vec2: np.ndarray) -> float:
        """Calculate cosine similarity"""
        return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

    def _split_long_chunk(self, text: str, max_size: int) -> List[str]:
        """Split long chunk into smaller pieces"""
        # Fallback to fixed-size splitting
        chunks = []
        start = 0
        while start < len(text):
            end = start + max_size
            chunks.append(text[start:end])
            start = end - 200  # Add overlap
        return chunks

# Pros: Semantically coherent, better retrieval
# Cons: Slower (requires embeddings), more complex

4. Hierarchical Chunking

# chunking/hierarchical.py
class HierarchicalChunker:
    """Create multi-level chunk hierarchy for different use cases"""

    def __init__(self, embedder):
        self.embedder = embedder

    async def chunk(self, text: str, document_id: str) -> Dict[str, List[Dict]]:
        """Create hierarchical chunks"""
        # Level 1: Document-level (for broad queries)
        doc_chunk = {
            'id': f"{document_id}_doc",
            'level': 'document',
            'text': text[:2000],  # Summary/first part
            'metadata': {'type': 'document_summary'}
        }

        # Level 2: Section-level (for medium queries)
        section_chunks = self._chunk_by_sections(text)

        # Level 3: Paragraph-level (for specific queries)
        paragraph_chunks = self._chunk_by_paragraphs(text)

        # Level 4: Sentence-level (for precise queries)
        sentence_chunks = self._chunk_by_sentences(text)

        return {
            'document': [doc_chunk],
            'sections': section_chunks,

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/23DailyView insight →

Black Hat USA

AI Business

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

Dev.to

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Dev.to

Salesforce Headless 360: Run Your CRM Without a Browser

Dev.to

What Is the Difference Between Native and Cross-Platform App Development in 2026?

Dev.to

RAG Systems in Production: Building Enterprise Knowledge Search

Key Points

RAG Systems in Production: Building Enterprise Knowledge Search

Introduction

Table of Contents

Understanding RAG Systems

What is RAG?

Why RAG for Enterprise?

RAG vs Fine-Tuning

System Architecture

End-to-End RAG Pipeline

Component Breakdown

Vector Database Selection

Comparison Matrix

Selection Criteria

Our Choice: pgvector

Embedding Strategies

Model Selection

Recommendation

Embedding Optimization

Chunking Techniques

Why Chunking Matters

Chunking Strategies

💡 Insights using this article

Related Articles

Black Hat USA

Why Your Brand Is Invisible to ChatGPT (And How to Fix It)

No Free Lunch Theorem — Deep Dive + Problem: Reverse Bits

Salesforce Headless 360: Run Your CRM Without a Browser

What Is the Difference Between Native and Cross-Platform App Development in 2026?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer