Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers
arXiv cs.CV / 5/4/2026
💬 OpinionModels & Research
Key Points
- The paper proposes using Discrete Cosine Transform (DCT) structure to improve Vision Transformers by tackling the difficult and costly initialization of query/key/value projection weights from random starts.
- It introduces a DCT-coefficient-based initialization method for self-attention that preserves structure and yields consistent classification accuracy improvements on CIFAR-10 and ImageNet-1K.
- The authors also present a DCT-based attention compression approach that truncates high-frequency DCT components of input patches to exploit frequency-domain decorrelation and reduce projection dimensionality.
- Experiments on Swin Transformer show that the compression significantly cuts computational overhead while keeping performance comparable to baselines.
Related Articles

ALM on Power Platform: ADO + GitHub, the best of both worlds
Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?
Dev.to

When a memorized rule fits your bug too well: a meta-trap of agent workflows
Dev.to
LWiAI Podcast #243 - GPT 5.5, DeepSeek V4, AI safety sabotage
Last Week in AI
Excellent discussion about LLM scaling [D]
Reddit r/MachineLearning