Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

arXiv cs.CL / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper presents LessIsMore, a training-free sparse attention method aimed at improving reasoning models’ efficiency without sacrificing accuracy during long-horizon generation.
It argues that, for reasoning, token importance is globally stable and shared across attention heads, allowing a unified token selection strategy to prevent selection errors from compounding over time.
LessIsMore uses cross-head unified token selection and a stable recency window to preserve recent context while reusing a consistent token set across layers.
Experiments across multiple model families and reasoning benchmarks show matching or improved accuracy while attending to substantially fewer tokens.
With kernel-level optimizations, the method reports up to 1.6× end-to-end decoding speedup and up to 1.72× faster sparse attention computation, and provides released code for adoption.

Abstract

Large reasoning models achieve strong performance through test-time scaling, but this incurs substantial computational overhead due to long decoding from short prompts. While sparse attention can reduce latency and memory usage, existing methods often degrade reasoning accuracy because selection errors accumulate over long generation horizons, or require costly retraining. We introduce LessIsMore, a training-free sparse attention mechanism for long-horizon reasoning. Our key insight is that token importance in reasoning is global and stable: critical tokens are largely shared across attention heads and remain stable over decoding steps. Guided by this structure, LessIsMore enforces cross-head unified token selection and preserves recent context via a stable recency window, yielding a globally consistent token set that can be reused across layers. Across multiple model families and challenging reasoning benchmarks, LessIsMore matches or improves accuracy while attending to substantially fewer tokens. With kernel-level optimizations, LessIsMore achieves up to

1.6\times

end-to-end decoding speedup and up to

1.72\times

faster sparse attention computation, with additional long-context results demonstrating the generality of our approach. Code is available at \href{https://github.com/DerrickYLJ/LessIsMore}{https://github.com/DerrickYLJ/LessIsMore}.

LLMs will be a commodity

Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

Dev.to

Less Is More: Fast and Accurate Reasoning with Cross-Head Unified Sparse Attention

Key Points

Abstract

Related Articles

LLMs will be a commodity

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dex lands $5.3M to grow its AI-driven talent matching platform

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer