Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, MCTF) may show a cliff-like accuracy collapse at high compression despite using different scoring mechanisms.
The paper explains this failure via two components: an inherent error-amplifier effect from layer-wise reduction (with predicted behavior like convex Pareto curves and a critical compression scaling r_crit ∝ 1/L) and degradation of pairwise similarity ranking consistency in deeper layers.
It introduces diagnostic metrics—ranking consistency (ρ_s) and off-diagonal correlation (ρ_off)—to attribute collapse to unstable pairwise scoring signals, which are sensitive to joint perturbations (scaling with O(N_p^2)) compared with more stable unary signals (O(N_p)).
Based on the diagnosis, the authors propose design principles and build CATIS, which uses unary signals to raise the trigger threshold and triage to suppress gain.
On a ViT-Large setup achieving 63% FLOPs reduction, CATIS preserves 96.9% of vanilla accuracy (81.0% top-1 on ImageNet-1K), while earlier baselines collapse to roughly 43–65%.

Abstract

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency

\rho_s

and off-diagonal correlation

\rho_\text{off}

, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and

r_{\text{crit}} \propto 1/L

; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from

\rho_s{=}0.88

0.27

in deep layers. Pairwise rankings are inherently unstable (

O(N_p^2)

joint perturbations) while unary signals enjoy greater stability (

O(N_p)

perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Reddit r/artificial

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

The Register

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

Dev.to

DEEPX and Hyundai Are Building Generative AI Robots

Dev.to

Stop Paying OpenAI to Read Garbage: The Two-Stage Agent Pipeline

Dev.to

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

Key Points

Abstract

Related Articles

¿Hasta qué punto podría la IA reemplazarnos en nuestros trabajos? A veces creo que la gente exagera un poco.

Magnificent irony as Meta staff unhappy about running surveillance software on work PCs

ETHENEA (ETHENEA Americas LLC) Analyst View: Asset Allocation Resilience in the 2026 Global Macro Cycle

DEEPX and Hyundai Are Building Generative AI Robots

Stop Paying OpenAI to Read Garbage: The Two-Stage Agent Pipeline

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer