Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

arXiv cs.LG / 4/6/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

アリックス（arXiv）で、WebGPUのAPI/セキュリティ設計による「dispatch（小さな命令単位）」オーバーヘッドがLLM推論のボトルネックになり得る点を、GPU4ベンダー・バックエンド3種・ブラウザ3種・モデル2種で体系的に計測した研究が発表された。
単純な単一操作ベンチマークはWebGPUのdispatchコストを約20倍に見積もり過ぎることを示し、dispatch APIオーバーヘッドだけでもVulkanで24–36µs、Metalで32–71µsに達することを明らかにした。
Pythonなど実装全体のコストを含むと1操作当たりのオーバーヘッドは約95µsとなり、最適化ではdispatchオーバーヘッドの内訳が重要な切り分けになると結論づけた。
Vulkanではkernel fusionがスループットを53%改善し、CUDA fusionは効果がないことから、実行効率よりも「1操作当たりのdispatch頻度・オーバーヘッド」が支配的であることを裏付けた。
追加で、torch-webgpu（PrivateUse1 PyTorchバックエンド＋FX-to-WebGPUコンパイラ）を構築し、参照環境でCUDAの11–12%に到達、RTX PRO 2000は計算量が少なくてもスループットでWebGPUの約1.4倍を示すなど、バックエンド選定が性能に大きく効くことを示した（コード/ベンチ/データはオープンソース）。

Abstract

WebGPU's security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. We present a systematic characterization of WebGPU dispatch overhead for LLM inference at batch size 1, spanning four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native) and three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). Our primary contribution is a sequential-dispatch methodology that reveals naive single-operation benchmarks overestimate dispatch cost by

{\sim}20\times

. The true per-dispatch cost of WebGPU API overhead alone is 24-36

\mu

s on Vulkan and 32-71

\mu

s on Metal, while the total per-operation overhead including Python cost is

{\sim}95

\mu

s, which turns out to be a distinction critical for optimization. On Vulkan, kernel fusion improves throughput by 53%, while CUDA fusion provides no benefit, confirming that per-operation overhead is a primary differentiator. LLM inference was tested across three major operating systems (Linux, Windows, macOS). We built

\texttt{torch-webgpu}

, a PrivateUse1-based out-of-tree PyTorch backend and an FX-to-WebGPU compiler, which on our reference platform achieves 11--12% of CUDA performance. At dtype-matched float32, RTX PRO 2000 achieves 1.4

\times

WebGPU's throughput despite

{\sim}6\times

less compute than RTX 5090. For dispatch overhead, backend choice is the dominant factor, although implementation choice also matters substantially within a backend (2.2

\times

for Metal). In terms of dispatch vs kernel compute efficiency, we conclude that at batch=1 with the current dispatch-heavy pipeline, per-operation overhead dominates regardless of kernel quality. All code, benchmarks, and raw data are open source.