Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay

arXiv cs.CV / 5/5/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper targets the challenge of quantizing vision Transformers to low bit-width, where activation outliers can degrade fully quantized deployment accuracy.
  • Instead of simply suppressing outliers or using post-training quantization, it proposes Colinearity-Decay (CD), a training-time structural regularizer that penalizes harmful cross-matrix alignment inside Transformer blocks.
  • CD is designed to be non-invasive: it does not change the model architecture or task loss, and it adds minimal training overhead when applied as a decoupled update.
  • Experiments across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning show consistent improvements in quantized accuracy while preserving (or improving) full-precision performance.
  • The authors conclude that structural regularization can effectively “prepare” Vision Transformers for low-bit deployment with zero additional inference-time cost.

Abstract

Low-bit quantization is a practical route for efficiently deploying vision Transformers, yet activation outliers complicate fully quantized deployment. Existing methods either handle quantization post-training or suppress large activations during training; however, aggressively restricting outliers in vision models can lead to a poorer trade-off between full-precision and quantized accuracy. We argue that rather than simply suppressing outliers, the training objective should control the structural amplification that makes them harmful. To this end, we introduce Colinearity-Decay (CD), a structural regularizer for ordered matrix pairs within Transformer blocks. CD penalizes detrimental cross-matrix alignment and mitigates extreme activations without altering the architecture or task loss. Applied as a decoupled update, CD is non-invasive and introduces minimal training overhead. Across ImageNet-1K pre-training, COCO detection, and downstream fine-tuning, CD consistently boosts quantized accuracy across multiple pipelines while preserving, or even improving, full-precision performance. Ultimately, our results demonstrate that structural regularization effectively prepares vision Transformers for low-bit deployment with zero inference-time overhead.