Dispatch-Aware Ragged Attention for Pruned Vision Transformers
arXiv cs.AI / 4/20/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper analyzes why token pruning for Vision Transformers (ViTs) does not proportionally reduce real-world attention latency when using variable-length attention APIs like FlashAttention-2 varlen and PyTorch NestedTensor SDPA.
- It identifies a dispatch-overhead bottleneck: for typical post-pruning token counts (≤197), matrix computation finishes in single-digit microseconds while host-side dispatch takes 60–90 microseconds.
- The authors propose a lightweight bidirectional Triton attention kernel designed to lower the dispatch floor (to ~40 microseconds), making pruning’s wall-clock benefits more apparent.
- Implemented in a full pack–attend–unpack pipeline, the approach delivers up to 2.24× end-to-end throughput versus padded PyTorch SDPA across four pruning methods and multiple DeiT model sizes, while preserving bit-exact classification behavior (logit differences <0.007 max absolute).
- Overall, the work reframes pruning performance as not only a FLOP-reduction problem but also a kernel/dispatch overhead optimization problem for short sequences common in ViTs.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to