Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • NVIDIAのCUDA Tile(CuTile)は、GPUカーネル開発をタイル中心のPython抽象化で簡素化しつつ、Tensor CoreやTMA効率を維持することを目指す新しい手法だと説明されています。
  • CuTileをHopper/Blackwell世代(H100 NVL、B200、RTX PRO 6000 Blackwell Server Edition)で、cuBLAS、Triton、WMMA、raw SIMTと比較する独立評価を行い、GEMM・融合型マルチヘッド注意・BF16/FP16のエンドツーエンドLLM推論をベンチマークしています。
  • BlackwellのデータセンターGPU(B200)では、融合型注意で最大1007 TFLOP/sを達成しFlashAttention-2を2.5倍上回る一方、GEMMではcuBLASの52〜79%に留まり、実装行数面では有用ですがベンダ最適ライブラリの代替にはまだ距離があります。
  • 一方で同じ注意カーネルはRTX PRO 6000(sm_120)ではFlashAttention-2のスループットの53%しか出ず、アーキテクチャ間の最適化ギャップが大きいことが示唆されています。
  • Tritonは追加のアーキテクチャ別チューニングなしでcuBLASの62〜101%を維持し、CuTileよりも高い移植性(ポータビリティ)を示したと結論づけています。

Abstract

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition. We benchmark representative AI workloads, including GEMM, fused multi-head attention, and end-to-end LLM inference in BF16/FP16 precision, to assess both performance and portability. Our results show that CuTile effectiveness is strongly workload- and architecture-dependent. On datacenter-class Blackwell (B200), CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x while requiring only 60 lines of Python kernel code. For GEMM, CuTile reaches 52-79% of cuBLAS performance in 22 lines of code (versus 123 for WMMA), making it a practical replacement for hand-written CUDA kernels but not yet for vendor-optimized libraries. However, the same CuTile attention kernel achieves only 53% of FlashAttention-2 throughput on RTX PRO 6000 (sm_120), exposing significant cross-architecture optimization gaps. In contrast, Triton sustains 62-101% of cuBLAS performance across all tested platforms without architecture-specific tuning, demonstrating substantially stronger portability.