Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D]

Reddit r/MachineLearning / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post addresses a common failure mode in codebase RAG: simple text chunking and embedding similarity miss cross-file structural dependencies like function-to-type/type-to-file relationships.
Instead of chunking, it proposes parsing the entire repository into ASTs (via Tree-sitter) and building a persistent typed node/edge graph representing functions, classes, types, modules, and relationships such as imports, calls, inheritance, and composition.
For query-time retrieval, it uses BM25 scoring over graph node metadata (names, signatures, docstrings, and paths) rather than embedding similarity, and it relies on graph traversal so retrieving a node can automatically include its direct dependencies.
The author reports an empirical reduction in LLM context from ~100K tokens to ~5K tokens per query on medium-to-large codebases, with a hierarchical fallback that includes a full-graph Mermaid map plus BM25 lookup and optional compression at 70% capacity.
The author discusses why BM25 can work well for code identifier matching and raises open questions about edge weighting, cross-encoder re-ranking, and handling dynamic-language call graphs, inviting benchmarking comparisons to embedding-based chunk retrieval.

Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems.

The problem Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This breaks down on code because semantic similarity at the chunk level doesn't capture structural relationships — a function in file A calling a type defined in file C won't surface that dependency through embedding proximity alone.

The approach: AST-derived typed graphs Instead of chunking, I parse every file using Tree-sitter into its AST, then extract a typed node/edge graph:

Nodes: functions, classes, interfaces, types, modules
Edges: imports, exports, call relationships, inheritance, composition

This gets stored in SQLite as a persistent graph. Parse cost is one-time per project.

Retrieval: BM25 over graph nodes At query time, instead of embedding similarity, I run BM25 scoring over node metadata (names, signatures, docstrings, file paths). Top-scoring nodes get passed to the LLM. The graph structure means a retrieved function automatically pulls in its direct dependencies via edge traversal.

Empirically this lands at ~5K tokens per query on medium-large codebases that would otherwise require ~100K tokens with naive full-context approaches.

Hierarchical fallback for complex queries For multi-file reasoning tasks:

A Mermaid diagram of the full graph serves as a persistent architectural map always in context
BM25 node retrieval handles targeted lookup
At 70% context capacity, a fast model compresses least-relevant nodes before passing to the primary model

Why BM25 over embeddings here Code identifiers (function names, type names, module paths) are highly distinctive lexically. BM25 outperforms embedding similarity on exact and near-exact identifier matching, which is the dominant retrieval pattern in code queries. Embeddings would likely help more for natural language docstring queries — haven't benchmarked that comparison rigorously yet.

Open questions I'm still thinking about:

Better edge-weighting strategies for the graph — currently all edges are unweighted
Whether re-ranking with a cross-encoder would meaningfully improve precision over BM25 alone
Handling dynamic languages where call graphs can't be fully resolved statically

Has anyone tackled codebase-scale RAG differently? Particularly curious if anyone's compared AST-graph approaches against embedding-based chunk retrieval on real codebases with quantitative benchmarks.

submitted by /u/Altruistic_Night_327
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/1DailyView insight →

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale

Microsoft Research Blog

langchain-fireworks==1.2.1

LangChain Releases

How PolySignals Works: Full Breakdown of Its AI Signal Engine

Dev.to

AI-Powered Prediction Market Signals: The Complete Polymarket Trading Guide for 2026

Dev.to

AI Agent Orchestration & Applied LLMs: Code Search, Workflow Optimization, Document Processing

Dev.to

Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D]

Key Points

💡 Insights using this article

Related Articles

Red-teaming a network of agents: Understanding what breaks when AI agents interact at scale

langchain-fireworks==1.2.1

How PolySignals Works: Full Breakdown of Its AI Signal Engine

AI-Powered Prediction Market Signals: The Complete Polymarket Trading Guide for 2026

AI Agent Orchestration & Applied LLMs: Code Search, Workflow Optimization, Document Processing

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer