MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MTServe to reduce the high inference cost of generative recommendation by avoiding repeated encoding of long user histories through cross-request KV cache reuse.
  • It addresses the “storage explosion” problem caused by massive per-user state sizes by virtualizing GPU memory and using host RAM as a scalable backup tier.
  • MTServe improves performance between GPU and host tiers with a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven cache replacement policy.
  • Experiments on both public and production datasets show up to 3.1× speedup while preserving very high KV cache hit ratios above 98.5%.

Abstract

Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).