Dispatch-Aware Ragged Attention for Pruned Vision Transformers

arXiv cs.AI / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper analyzes why token pruning for Vision Transformers (ViTs) does not proportionally reduce real-world attention latency when using variable-length attention APIs like FlashAttention-2 varlen and PyTorch NestedTensor SDPA.
  • It identifies a dispatch-overhead bottleneck: for typical post-pruning token counts (≤197), matrix computation finishes in single-digit microseconds while host-side dispatch takes 60–90 microseconds.
  • The authors propose a lightweight bidirectional Triton attention kernel designed to lower the dispatch floor (to ~40 microseconds), making pruning’s wall-clock benefits more apparent.
  • Implemented in a full pack–attend–unpack pipeline, the approach delivers up to 2.24× end-to-end throughput versus padded PyTorch SDPA across four pruning methods and multiple DeiT model sizes, while preserving bit-exact classification behavior (logit differences <0.007 max absolute).
  • Overall, the work reframes pruning performance as not only a FLOP-reduction problem but also a kernel/dispatch overhead optimization problem for short sequences common in ViTs.

Abstract

Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet when pruned sequences are executed with state-of-the-art variable-length attention APIs -- including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA-the wall-clock attention latency doesn't scale accordingly. We trace this to a dispatch-overhead bottleneck: at the short, post-pruning sequence lengths typical of ViTs (<=197 tokens), actual matrix arithmetic completes in single-digit microseconds while the host-side dispatch path consumes 60-90 us. We present a lightweight, bidirectional Triton attention kernel whose dispatch floor is 40 us roughly 1.5x lower than FlashAttention-2 varlen-allowing pruning savings to become more visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline, our system achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA consistently across four pruning algorithms (Threshold-L2, DynamicViT, EViT, ATS), scales across DeiT-T/S/B, and maintains bit-exact classification predictions with <0.007 max absolute logit difference.