MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MTServe to reduce the high inference cost of generative recommendation by avoiding repeated encoding of long user histories through cross-request KV cache reuse.
It addresses the “storage explosion” problem caused by massive per-user state sizes by virtualizing GPU memory and using host RAM as a scalable backup tier.
MTServe improves performance between GPU and host tiers with a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven cache replacement policy.
Experiments on both public and production datasets show up to 3.1× speedup while preserving very high KV cache hit ratios above 98.5%.

Abstract

Generative recommendation (GR) offers superior modeling capabilities but suffers from prohibitive inference costs due to the repeated encoding of long user histories. While cross-request Key-Value (KV) cache reuse presents a significant optimization opportunity, the massive scale of individual user states creates a storage explosion that far exceeds physical GPU limits. We propose MTServe, a hierarchical cache management system that virtualizes GPU memory by leveraging host RAM as a scalable backup store. To bridge the I/O gap between tiers, MTServe introduces a suite of system-level optimizations, including a hybrid storage layout, an asynchronous data transfer pipeline, and a locality-driven replacement policy. On both public and production datasets, MTServe delivers up to 3.1* speedup while maintaining near-perfect hit ratios (>98.5%).

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/28DailyView insight →

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

How I Automate My Dev Workflow with Claude Code Hooks

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

MTServe: Efficient Serving for Generative Recommendation Models with Hierarchical Caches

Key Points

Abstract

💡 Insights using this article

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

How I Automate My Dev Workflow with Claude Code Hooks

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer