EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

arXiv cs.CL / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

EntropyCache introduces a training-free KV caching method for diffusion language models that uses the maximum entropy of newly decoded token distributions to decide when to recompute KV caches, reducing forward passes per denoising step.
The method achieves an O(V) per-step decision with no dependence on context length or model scale, addressing cache drift efficiently.
Two observations motivate the approach: decoded token entropy correlates with KV cache drift and decoded token feature volatility persists after unmasking, justifying recomputing the few most recently decoded tokens.
Empirical results show 15.2x–26.4x speedups on standard benchmarks and 22.4x–24.1x on chain-of-thought tasks, with competitive accuracy and only 0.5% of inference time spent on overhead.
Code is available on GitHub at https://github.com/mscheong01/EntropyCache.

Abstract

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the

k

most recently decoded tokens. The skip-or-recompute decision requires only

O(V)

computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves

15.2\times

26.4\times

speedup on standard benchmarks and

22.4\times

24.1\times

on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only

0.5\%

of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/20DailyView insight →

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

Teleport Just Pivoted to AI Agent Identity. VentureBeat Mapped the Governance Gap They Are Filling.

Dev.to

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

Towards Data Science

OpenAI is throwing everything into building a fully automated researcher

MIT Technology Review

v1.82.3.dev.2

LiteLLM Releases

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Key Points

Abstract

💡 Insights using this article

Related Articles

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Teleport Just Pivoted to AI Agent Identity. VentureBeat Mapped the Governance Gap They Are Filling.

Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early)

OpenAI is throwing everything into building a fully automated researcher

v1.82.3.dev.2

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer