Rethink MAE with Linear Time-Invariant Dynamics

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that common probing methods for frozen vision models (e.g., GAP and CLS token approaches) incorrectly treat patch tokens as an orderless bag-of-features, even though token order is a crucial and exploitable signal.
It introduces SSMProbe, a probing framework based on State Space Models (SSMs) modeled as discrete Linear Time-Invariant (LTI) dynamical systems, where sequence order determines the final state through memory decay.
By casting token ordering as an information-scheduling problem, the authors compare fixed scan heuristics with a differentiable Sinkhorn-based soft permutation learned from downstream supervision.
Experiments across major frozen vision backbones (MAE, BEiT, DINOv2, and ViT in CLS-ablation extremes) show a large “order gap”: fixed token order scans fail for localized patch features, while the learned soft permutation recovers strong performance.
The study concludes that pretraining objectives shape token structure in an order-dependent way, and the SSM probe provides a new diagnostic lens for understanding and exploiting this heterogeneity in visual representations.

Abstract

Standard representation probing for visual models relies on mathematically permutation-invariant operations like Global Average Pooling (GAP) or CLS tokens, treating patch representations as an unstructured bag-of-words. We challenge this paradigm by demonstrating that token order is a critical, exploitable dimension in frozen visual representations (e.g., MAE, BEiT, DINOv2, and ViT as CLS-ablation extreme). We propose SSMProbe, a probing framework driven by a State Space Model (SSM). Operating as discrete Linear Time-Invariant (LTI) dynamical systems, SSMs act as permutation-sensitive probes where sequence order strictly dictates the final state due to inherent memory decay. Formulating token ordering as an information scheduling problem, we compare fixed scan heuristics against a differentiable soft permutation (Sinkhorn-based) learned from downstream supervision. Evaluations on standard and fine-grained classification benchmarks reveal a striking order gap: while fixed scans fail dramatically on highly localized patch features, our learned soft permutation successfully extracts highly competitive performance from otherwise heavily localized patch sequences. We find that pre-training objectives fundamentally shape token structure: DINOv2 concentrates global semantics in optimized CLS tokens leaving patches hyperspecialized, pure MAE preserves distributed representations with heterogeneous patch informativeness, and ViT represents a supervised CLS-dominated extreme. BEiT occupies middle ground. This heterogeneity is order-dependent -- meaning the SSM probe's performance depends critically on which tokens are placed at which temporal positions -- and is not merely a topological property of the spatial grid. SSMProbe's learned routing effectively discovers and exploits this heterogeneity, offering a powerful new diagnostic lens for visual representation analysis.