KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

arXiv cs.LG / 3/11/2026

Signals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

KernelCraft is a new benchmark designed to evaluate large language model (LLM) agents' ability to generate and optimize low-level kernels for emerging AI accelerators with novel ISAs, addressing the challenges of manual kernel development.
The benchmark uses a function-calling, feedback-driven workflow where the agent iteratively refines kernels using automated feedback from compilation, simulation, and correctness validation.
Experiments on three emerging accelerator platforms across over 20 machine learning tasks show that top LLM agents can produce valid and optimized kernels within a few refinement steps, sometimes outperforming traditional template-based compiler methods.
KernelCraft demonstrates potential to significantly reduce the labor, time, and error rates involved in kernel development for new hardware, facilitating faster market adoption of novel AI accelerators.

Computer Science > Hardware Architecture

arXiv:2603.08721 (cs)

[Submitted on 10 Feb 2026]

Title:KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Authors:Jiayi Nie, Haoran Wu, Yao Lai, Zeyu Cao, Cheng Zhang, Binglei Lou, Erwei Wang, Jianyi Cheng, Timothy M. Jones, Robert Mullins, Rika Antonova, Yiren Zhao

View a PDF of the paper titled KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware, by Jiayi Nie and 11 other authors

View PDF HTML (experimental)

Abstract:New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent's ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template-based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers.

Subjects:	Hardware Architecture (cs.AR); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2603.08721 [cs.AR]
	(or arXiv:2603.08721v1 [cs.AR] for this version)
	https://doi.org/10.48550/arXiv.2603.08721 Focus to learn more arXiv-issued DOI via DataCite

Submission history

From: Jiayi Nie [view email]
[v1] Tue, 10 Feb 2026 14:52:02 UTC (1,343 KB)

Full-text links:

Access Paper:

View a PDF of the paper titled KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware, by Jiayi Nie and 11 other authors

View PDF
HTML (experimental)
TeX Source

view license

Current browse context:

cs.AR

< prev | next >

new | recent | 2026-03

Change to browse by:

cs
cs.LG
cs.SE

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

Data provided by:

Bookmark

Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Bibliographic Explorer (What is the Explorer?)

Connected Papers Toggle

Connected Papers (What is Connected Papers?)

Litmaps Toggle

Litmaps (What is Litmaps?)

scite.ai Toggle

scite Smart Citations (What are Smart Citations?)

Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle

alphaXiv (What is alphaXiv?)

Links to Code Toggle

CatalyzeX Code Finder for Papers (What is CatalyzeX?)

DagsHub Toggle

DagsHub (What is DagsHub?)

GotitPub Toggle

Gotit.pub (What is GotitPub?)

Huggingface Toggle

Hugging Face (What is Huggingface?)

Links to Code Toggle

Papers with Code (What is Papers with Code?)

ScienceCast Toggle

ScienceCast (What is ScienceCast?)

Demos

Replicate Toggle

Replicate (What is Replicate?)

Spaces Toggle

Hugging Face Spaces (What is Spaces?)

Spaces Toggle

TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Link to Influence Flower

Influence Flower (What are Influence Flowers?)

Core recommender toggle

CORE Recommender (What is CORE?)

Author
Venue
Institution
Topic

About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

The Batch

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development

Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?

Dev.to

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Key Points

Computer Science > Hardware Architecture

Title:KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Related Articles

Attacks On Data Centers, Qwen3.5 In All Sizes, DeepSeek’s Huawei Play, Apple’s Multimodal Tokenizer

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".

Lessons from Academic Plagiarism Tools for SaaS Product Development

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer