LongFlow: Efficient KV Cache Compression for Reasoning M

arXiv cs.LG / 3/13/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

LongFlow introduces a KV cache compression method to reduce memory consumption and bandwidth pressure during attention in long-output reasoning models.
It derives an efficient importance estimation metric from the current query, achieving negligible overhead and requiring no extra auxiliary storage.
A custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator further boosts system-level efficiency.
Experiments report up to 11.8x throughput improvement with about 80% KV cache compression and minimal impact on model accuracy.
The work targets the limitations of prior KV cache optimizations, which were designed for long-input/short-output scenarios and are less effective for long-output reasoning.

Abstract

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences, leading to significantly increased deployment costs. In particular, long outputs require large KV caches, resulting in high memory consumption and severe bandwidth pressure during attention computation. Most existing KV cache optimization methods are designed for long-input, short-output scenarios and are ineffective for the long-output setting of reasoning models. Moreover, importance estimation in prior work is computationally expensive and becomes prohibitive when continuous re-evaluation is required during long generation. To address these challenges, we propose LongFlow, a KV cache compression method with an efficient importance estimation metric derived from an intermediate result of attention computation using only the current query. This design introduces negligible computational overhead and requires no auxiliary storage. We further develop a custom kernel that fuses FlashAttention, importance estimation, and token eviction into a single optimized operator, improving system-level efficiency. Experiments show that LongFlow achieves up to an 11.8 times throughput improvement with 80% KV cache compression with minimal impact on model accuracy.