Wanted to share an approach I've been using for retrieval-augmented generation over large codebases and get feedback from people thinking about similar problems.
The problem Naive codebase RAG typically works by chunking files into text segments and embedding them for similarity search. This breaks down on code because semantic similarity at the chunk level doesn't capture structural relationships — a function in file A calling a type defined in file C won't surface that dependency through embedding proximity alone.
The approach: AST-derived typed graphs Instead of chunking, I parse every file using Tree-sitter into its AST, then extract a typed node/edge graph:
- Nodes: functions, classes, interfaces, types, modules
- Edges: imports, exports, call relationships, inheritance, composition
This gets stored in SQLite as a persistent graph. Parse cost is one-time per project.
Retrieval: BM25 over graph nodes At query time, instead of embedding similarity, I run BM25 scoring over node metadata (names, signatures, docstrings, file paths). Top-scoring nodes get passed to the LLM. The graph structure means a retrieved function automatically pulls in its direct dependencies via edge traversal.
Empirically this lands at ~5K tokens per query on medium-large codebases that would otherwise require ~100K tokens with naive full-context approaches.
Hierarchical fallback for complex queries For multi-file reasoning tasks:
- A Mermaid diagram of the full graph serves as a persistent architectural map always in context
- BM25 node retrieval handles targeted lookup
- At 70% context capacity, a fast model compresses least-relevant nodes before passing to the primary model
Why BM25 over embeddings here Code identifiers (function names, type names, module paths) are highly distinctive lexically. BM25 outperforms embedding similarity on exact and near-exact identifier matching, which is the dominant retrieval pattern in code queries. Embeddings would likely help more for natural language docstring queries — haven't benchmarked that comparison rigorously yet.
Open questions I'm still thinking about:
- Better edge-weighting strategies for the graph — currently all edges are unweighted
- Whether re-ranking with a cross-encoder would meaningfully improve precision over BM25 alone
- Handling dynamic languages where call graphs can't be fully resolved statically
Has anyone tackled codebase-scale RAG differently? Particularly curious if anyone's compared AST-graph approaches against embedding-based chunk retrieval on real codebases with quantitative benchmarks.
[link] [comments]



