Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

arXiv cs.CL / 3/31/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

Kernel-Smithは、GPUカーネル/演算子を高性能に生成するためのフレームワークで、評価駆動の進化的エージェントと進化志向の後処理（post-trainingレシピ）を統合して探索の安定性と性能を高めることを狙っている。
エージェント側では、実行可能な候補の集団を維持し、上位かつ多様なプログラムのアーカイブと、コンパイル可否・正しさ・スピードアップに関する構造化フィードバックを用いて反復的に改善する。
信頼性のために、NVIDIA GPU向けTritonと、MetaX GPU向けMacaそれぞれに対するバックエンド別の評価サービスを構築している。
学習（後処理）では長期の進化軌跡を「ステップ中心」の教師信号と強化学習信号に変換し、進化ループ内で強力なローカル改善器として機能するよう最適化する方針を採る。
KernelBenchでのTritonバックエンドではKernel-Smith-235B-RLが平均スピードアップ比で最先端のプロプライエタリモデル（Gemini-3.0-pro、Claude-4.6-opus）を上回り、さらにMacaでもKernel-Smith-MACA-30Bが大規模先行モデル（DeepSeek-V3.2-think、Qwen3-235B-2507-think）より優位で、SGLangやLMDeployへのプロダクション向け貢献も報告している。

Abstract

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

How to Verify Information Online and Avoid Fake Content

Dev.to

I built an AI code reviewer solo while working full-time — honest post-launch breakdown

Dev.to

Mobile App MVP: Build, Launch, and Validate in Under a Week

Dev.to

Why Your State Management Is Slowing Down AI-Assisted Development

Dev.to

How to Reduce OpenClaw and Agent Token Costs

Dev.to

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

Key Points

Abstract

Related Articles

How to Verify Information Online and Avoid Fake Content

I built an AI code reviewer solo while working full-time — honest post-launch breakdown

Mobile App MVP: Build, Launch, and Validate in Under a Week

Why Your State Management Is Slowing Down AI-Assisted Development

How to Reduce OpenClaw and Agent Token Costs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer