Rethink MAE with Linear Time-Invariant Dynamics
arXiv cs.CV / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that common probing methods for frozen vision models (e.g., GAP and CLS token approaches) incorrectly treat patch tokens as an orderless bag-of-features, even though token order is a crucial and exploitable signal.
- It introduces SSMProbe, a probing framework based on State Space Models (SSMs) modeled as discrete Linear Time-Invariant (LTI) dynamical systems, where sequence order determines the final state through memory decay.
- By casting token ordering as an information-scheduling problem, the authors compare fixed scan heuristics with a differentiable Sinkhorn-based soft permutation learned from downstream supervision.
- Experiments across major frozen vision backbones (MAE, BEiT, DINOv2, and ViT in CLS-ablation extremes) show a large “order gap”: fixed token order scans fail for localized patch features, while the learned soft permutation recovers strong performance.
- The study concludes that pretraining objectives shape token structure in an order-dependent way, and the SSM probe provides a new diagnostic lens for understanding and exploiting this heterogeneity in visual representations.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to