Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU
arXiv cs.AI / 4/20/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper introduces Ragged Paged Attention (RPA), a TPU-focused attention kernel designed to make LLM inference efficient on architectures where serving workloads are dynamic and “ragged.”
- RPA improves performance and flexibility using fine-grained tiling for efficient dynamic slicing, a fused pipeline that combines KV-cache updates with attention computation, and a compilation strategy that generates specialized kernels for decode, prefill, and mixed workloads.
- Experiments on Llama 3 8B running on TPU7x show strong utilization metrics, reaching up to 86% memory bandwidth utilization during decode and 73% model FLOPs utilization during prefill.
- The work is implemented with Pallas and Mosaic and has been integrated as the primary TPU backend in vLLM and SGLang, aiming to provide a production-ready foundation for TPU inference kernel design.



