CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

arXiv cs.LG / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

CuTeGen is an agentic LLM-based framework that automates the generate–test–refine cycle for producing and improving high-performance GPU kernels with correctness preserved via execution-based validation.
Instead of one-shot kernel generation or brute-force search, it progressively refines a single evolving kernel, using structured debugging and staged optimization guided by the CuTe abstraction.
CuTeGen generates kernels in the CuTe abstraction layer to expose key performance structures (e.g., tiling and data movement) in a representation that is more stable for iterative modification.
The framework delays and stages profiling feedback and uses workload-aware optimization prompts to steer improvements toward competitive performance.
Experiments on matrix multiplication and activation workloads show CuTeGen can generate functionally correct kernels and reach performance competitive with optimized library baselines.

Abstract

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate--test--refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

MarkTechPost

The house asked me a question

Dev.to

Precision Clip Selection: How AI Suggests Your In and Out Points

Dev.to

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

Key Points

Abstract

Related Articles

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

The house asked me a question

Precision Clip Selection: How AI Suggests Your In and Out Points

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer