Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

arXiv cs.AI / 4/1/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that GPU kernel optimization with LLM agents is costly because each trial requires generating, compiling, validating, and profiling candidates across a large design space.
  • It proposes two efficiency principles: using a compact, higher-level domain-specific language (DSL) to reduce low-impact reasoning while preserving key optimization levers, and using Speed-of-Light (SOL) guidance to estimate performance headroom and stop wasting search near diminishing returns.
  • The implementation, μCUTLASS, provides a DSL plus a compiler for CUTLASS-backed GPU kernels covering configuration, epilogue fusion, and multi-stage pipelines.
  • Experiments on 59 KernelBench problems show that moving from low-level code generation to DSL-based generation (GPT-5-mini) changes performance from a 0.40× regression to a 1.27× speedup over PyTorch, with SOL guidance increasing it further to 1.56×.
  • SOL-guided budgeting reduces LLM token usage by 19–43% while preserving at least 95% of the geomean speedup, and it can flag potential benchmark-gaming kernels that look fast but fail the intended computation.

Abstract

Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in \muCUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, \muCUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation.