MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

arXiv cs.AI / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper addresses a key scaling bottleneck of chain-of-thought (CoT) reasoning in LLMs: the KV cache grows linearly with generated tokens, increasing both speed and memory costs.
It proposes MemoSight, a unified framework that combines context compression with multi-token prediction to preserve CoT performance while improving efficiency.
MemoSight uses a minimalist design that applies the same general mechanism (special tokens and token-type-specific position layouts) to both context compression and multi-token prediction.
Experiments on four reasoning benchmarks show up to a 66% reduction in KV cache footprint and up to 1.56× faster inference, outperforming existing CoT compression approaches.

Abstract

While Chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning problems, as KV cache grows linearly with the number of generated tokens, CoT reasoning faces scaling issues in terms of speed and memory usage. In this work, we propose MemoSight (Memory-Foresight-based reasoning), a unified framework that integrates both context compression and multi-token prediction to mitigate the efficiency issues while maintaining CoT reasoning performance. Our framework adopts the same minimalist design for both context compression and multi-token prediction via special tokens and their corresponding position layout tailored to each token type. Comprehensive experiments on four reasoning benchmarks demonstrate that MemoSight reduces the KV cache footprint by up to 66% and accelerates inference by 1.56x, while outperforming existing CoT compression methods.