QCFuse: Query-Centric Cache Fusion for Efficient RAG Inference
arXiv cs.AI / 2026/4/13
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- Cache fusion techniques can speed up RAG-augmented LLM generation by reusing KV cache and selectively recomputing tokens, but prior approaches often lack global awareness of the user query when choosing what to recompute.
- QCFuse is proposed as a query-centric KV cache fusion system that uses semantic summary anchors to build more accurate query representations without incurring prohibitive overhead.
- It selectively recomputes tokens tied to the user query and updates tokens according to the attention distribution from the most critical Transformer layer to keep the computation pipeline efficient.
- Experiments on real-world datasets show about a 40% improvement in response efficiency while maintaining equivalent accuracy versus existing methods.
- In some cases, QCFuse also provides an attention denoising effect that can further improve response accuracy, suggesting additional inference optimization potential.



