Abstract
Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency \rho_s and off-diagonal correlation \rho_\text{off}, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and r_{\text{crit}} \propto 1/L; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from \rho_s{=}0.88 to 0.27 in deep layers. Pairwise rankings are inherently unstable (O(N_p^2) joint perturbations) while unary signals enjoy greater stability (O(N_p) perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.