Codebase-scale retrieval using AST-derived graphs + BM25 — reducing LLM context from 100K to 5K tokens [D]

Reddit r/MachineLearning / 5/1/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The post addresses a common failure mode in codebase RAG: simple text chunking and embedding similarity miss cross-file structural dependencies like function-to-type/type-to-file relationships.
  • Instead of chunking, it proposes parsing the entire repository into ASTs (via Tree-sitter) and building a persistent typed node/edge graph representing functions, classes, types, modules, and relationships such as imports, calls, inheritance, and composition.
  • For query-time retrieval, it uses BM25 scoring over graph node metadata (names, signatures, docstrings, and paths) rather than embedding similarity, and it relies on graph traversal so retrieving a node can automatically include its direct dependencies.
  • The author reports an empirical reduction in LLM context from ~100K tokens to ~5K tokens per query on medium-to-large codebases, with a hierarchical fallback that includes a full-graph Mermaid map plus BM25 lookup and optional compression at 70% capacity.
  • The author discusses why BM25 can work well for code identifier matching and raises open questions about edge weighting, cross-encoder re-ranking, and handling dynamic-language call graphs, inviting benchmarking comparisons to embedding-based chunk retrieval.

Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems.

The problem Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This breaks down on code because semantic similarity at the chunk level doesn't capture structural relationships — a function in file A calling a type defined in file C won't surface that dependency through embedding proximity alone.

The approach: AST-derived typed graphs Instead of chunking, I parse every file using Tree-sitter into its AST, then extract a typed node/edge graph:

  • Nodes: functions, classes, interfaces, types, modules
  • Edges: imports, exports, call relationships, inheritance, composition

This gets stored in SQLite as a persistent graph. Parse cost is one-time per project.

Retrieval: BM25 over graph nodes At query time, instead of embedding similarity, I run BM25 scoring over node metadata (names, signatures, docstrings, file paths). Top-scoring nodes get passed to the LLM. The graph structure means a retrieved function automatically pulls in its direct dependencies via edge traversal.

Empirically this lands at ~5K tokens per query on medium-large codebases that would otherwise require ~100K tokens with naive full-context approaches.

Hierarchical fallback for complex queries For multi-file reasoning tasks:

  1. A Mermaid diagram of the full graph serves as a persistent architectural map always in context
  2. BM25 node retrieval handles targeted lookup
  3. At 70% context capacity, a fast model compresses least-relevant nodes before passing to the primary model

Why BM25 over embeddings here Code identifiers (function names, type names, module paths) are highly distinctive lexically. BM25 outperforms embedding similarity on exact and near-exact identifier matching, which is the dominant retrieval pattern in code queries. Embeddings would likely help more for natural language docstring queries — haven't benchmarked that comparison rigorously yet.

Open questions I'm still thinking about:

  • Better edge-weighting strategies for the graph — currently all edges are unweighted
  • Whether re-ranking with a cross-encoder would meaningfully improve precision over BM25 alone
  • Handling dynamic languages where call graphs can't be fully resolved statically

Has anyone tackled codebase-scale RAG differently? Particularly curious if anyone's compared AST-graph approaches against embedding-based chunk retrieval on real codebases with quantitative benchmarks.

submitted by /u/Altruistic_Night_327
[link] [comments]