FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers
arXiv cs.CV / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- FreqFormer addresses the quadratic runtime and memory cost of long-sequence video diffusion transformers by replacing uniform attention with frequency-aware heterogeneous attention.
- It splits token features into spectral bands and applies different attention operators per band: dense attention for low frequencies, block-sparse attention for mid frequencies, and sliding-window local attention for high frequencies.
- A lightweight spectral routing network dynamically assigns attention heads across bands based on layer statistics and the diffusion timestep, shifting compute from global structure to fine detail as denoising progresses.
- The work includes a fused GPU execution plan and a complexity model, with simulation results from 64K to 1M tokens showing substantial reductions in estimated attention FLOPs and KV memory traffic compared with dense attention.
- The paper also provides theoretical interpretations (including an orthonormal-decomposition view) and systems analyses (throughput, arithmetic intensity, memory traffic, and scaling), arguing this as a practical direction for long-video diffusion transformers.
Related Articles
How I Automate My Dev Workflow with Claude Code Hooks
Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System
Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)
Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹
Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming
Dev.to