ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

arXiv cs.LG / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes ZoomR, an approach to reduce LLM memory use during long-form reasoning by adaptively compressing verbose intermediate thoughts into summaries.
ZoomR introduces a dynamic key-value (KV) cache selection policy that performs hierarchical retrieval: it first uses summary keys as coarse indices during decoding, then “zooms in” to retrieve fine-grained details only when needed.
Because it avoids full-cache attention at every decoding step, ZoomR targets the main bottleneck that KV cache size grows with output length.
Experiments on math and reasoning benchmarks show competitive performance versus baselines while cutting inference memory requirements by more than 4×.
The results suggest that multi-granularity KV selection can make autoregressive decoding more scalable for tasks requiring long outputs.

Abstract

Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically "zooming in" on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than

4\times

. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.

Black Hat Asia

AI Business

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

Dev.to

I scanned every major vibe coding tool for security. None scored above 90.

Dev.to

I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.

Dev.to

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

Reddit r/artificial

ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval

Key Points

Abstract

Related Articles

Black Hat Asia

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

I scanned every major vibe coding tool for security. None scored above 90.

I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer