Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

arXiv cs.AI / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, MCTF) may show a cliff-like accuracy collapse at high compression despite using different scoring mechanisms.
  • The paper explains this failure via two components: an inherent error-amplifier effect from layer-wise reduction (with predicted behavior like convex Pareto curves and a critical compression scaling r_crit ∝ 1/L) and degradation of pairwise similarity ranking consistency in deeper layers.
  • It introduces diagnostic metrics—ranking consistency (ρ_s) and off-diagonal correlation (ρ_off)—to attribute collapse to unstable pairwise scoring signals, which are sensitive to joint perturbations (scaling with O(N_p^2)) compared with more stable unary signals (O(N_p)).
  • Based on the diagnosis, the authors propose design principles and build CATIS, which uses unary signals to raise the trigger threshold and triage to suppress gain.
  • On a ViT-Large setup achieving 63% FLOPs reduction, CATIS preserves 96.9% of vanilla accuracy (81.0% top-1 on ImageNet-1K), while earlier baselines collapse to roughly 43–65%.

Abstract

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency \rho_s and off-diagonal correlation \rho_\text{off}, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and r_{\text{crit}} \propto 1/L; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from \rho_s{=}0.88 to 0.27 in deep layers. Pairwise rankings are inherently unstable (O(N_p^2) joint perturbations) while unary signals enjoy greater stability (O(N_p) perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.