Dispatch-Aware Ragged Attention for Pruned Vision Transformers

arXiv cs.AI / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper analyzes why token pruning for Vision Transformers (ViTs) does not proportionally reduce real-world attention latency when using variable-length attention APIs like FlashAttention-2 varlen and PyTorch NestedTensor SDPA.
It identifies a dispatch-overhead bottleneck: for typical post-pruning token counts (≤197), matrix computation finishes in single-digit microseconds while host-side dispatch takes 60–90 microseconds.
The authors propose a lightweight bidirectional Triton attention kernel designed to lower the dispatch floor (to ~40 microseconds), making pruning’s wall-clock benefits more apparent.
Implemented in a full pack–attend–unpack pipeline, the approach delivers up to 2.24× end-to-end throughput versus padded PyTorch SDPA across four pruning methods and multiple DeiT model sizes, while preserving bit-exact classification behavior (logit differences <0.007 max absolute).
Overall, the work reframes pruning performance as not only a FLOP-reduction problem but also a kernel/dispatch overhead optimization problem for short sequences common in ViTs.

Abstract

Token pruning methods for Vision Transformers (ViTs) promise quadratic reductions in attention FLOPs by dropping uninformative patches. Yet when pruned sequences are executed with state-of-the-art variable-length attention APIs -- including FlashAttention-2's varlen and PyTorch's NestedTensor SDPA-the wall-clock attention latency doesn't scale accordingly. We trace this to a dispatch-overhead bottleneck: at the short, post-pruning sequence lengths typical of ViTs (<=197 tokens), actual matrix arithmetic completes in single-digit microseconds while the host-side dispatch path consumes 60-90 us. We present a lightweight, bidirectional Triton attention kernel whose dispatch floor is 40 us roughly 1.5x lower than FlashAttention-2 varlen-allowing pruning savings to become more visible in wall-clock time. Integrated into a complete pack-attend-unpack pipeline, our system achieves up to 2.24x end-to-end throughput over padded PyTorch SDPA consistently across four pruning algorithms (Threshold-L2, DynamicViT, EViT, ATS), scales across DeiT-T/S/B, and maintains bit-exact classification predictions with <0.007 max absolute logit difference.

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Dev.to

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

Dev.to

Dispatch-Aware Ragged Attention for Pruned Vision Transformers

Key Points

Abstract

Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else

Local LLM Beginner’s Guide (Mac - Apple Silicon)

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer