MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

MemBoost is introduced as a memory-boosted LLM serving framework aimed at reducing inference costs in real-world deployments where users issue repeated or near-duplicate queries.
The framework reuses previously generated answers and retrieves relevant supporting information so that a lightweight model can respond cheaply, reserving stronger models for uncertain or difficult cases via cost-aware routing.
Unlike conventional retrieval-augmented generation, MemBoost is tailored for interactive settings by emphasizing answer reuse, continual memory growth, and incremental escalation.
Experiments on multiple models under simulated workloads indicate substantial reductions in expensive large-model calls and overall inference cost while keeping answer quality close to a strong-model baseline.

Abstract

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.