AI Navigate

KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

arXiv cs.LG / 3/11/2026

Signals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • KernelCraft is a new benchmark designed to evaluate large language model (LLM) agents' ability to generate and optimize low-level kernels for emerging AI accelerators with novel ISAs, addressing the challenges of manual kernel development.
  • The benchmark uses a function-calling, feedback-driven workflow where the agent iteratively refines kernels using automated feedback from compilation, simulation, and correctness validation.
  • Experiments on three emerging accelerator platforms across over 20 machine learning tasks show that top LLM agents can produce valid and optimized kernels within a few refinement steps, sometimes outperforming traditional template-based compiler methods.
  • KernelCraft demonstrates potential to significantly reduce the labor, time, and error rates involved in kernel development for new hardware, facilitating faster market adoption of novel AI accelerators.

Computer Science > Hardware Architecture

arXiv:2603.08721 (cs)
[Submitted on 10 Feb 2026]

Title:KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware

View a PDF of the paper titled KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware, by Jiayi Nie and 11 other authors
View PDF HTML (experimental)
Abstract:New AI accelerators with novel instruction set architectures (ISAs) often require developers to manually craft low-level kernels -- a time-consuming, laborious, and error-prone process that cannot scale across diverse hardware targets. This prevents emerging hardware platforms from reaching the market efficiently. While prior LLM-based code generation has shown promise in mature GPU ecosystems, it remains unclear whether agentic LLM systems can quickly produce valid and efficient kernels for emerging hardware with new ISAs. We present KernelCraft: the first benchmark to evaluate an LLM agent's ability to generate and optimize low-level kernels for customized accelerators via a function-calling, feedback-driven workflow. Within KernelCraft, the agent refines kernels under ISA and hardware constraints using automated feedback derived from compilation checks, simulation, and correctness validation against ground truth. In our experiments, we assess agent performance across three emerging accelerator platforms on more than 20 ML tasks, each with 5 diverse task configurations, with special evaluation of task configuration complexity. Across four leading reasoning models, top agents produce functionally valid kernels for previously unseen ISAs within a few refinement steps, with optimized kernels that match or outperform template-based compiler baselines. With that, we demonstrate the potential for reducing the cost of kernel development for accelerator designers and kernel developers.
Subjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as: arXiv:2603.08721 [cs.AR]
  (or arXiv:2603.08721v1 [cs.AR] for this version)
  https://doi.org/10.48550/arXiv.2603.08721
Focus to learn more
arXiv-issued DOI via DataCite

Submission history

From: Jiayi Nie [view email]
[v1] Tue, 10 Feb 2026 14:52:02 UTC (1,343 KB)
Full-text links:

Access Paper:

    View a PDF of the paper titled KernelCraft: Benchmarking for Agentic Close-to-Metal Kernel Generation on Emerging Hardware, by Jiayi Nie and 11 other authors
  • View PDF
  • HTML (experimental)
  • TeX Source
Current browse context:
cs.AR
< prev   |   next >
Change to browse by:

References & Citations

export BibTeX citation Loading...

BibTeX formatted citation

×
Data provided by:

Bookmark

BibSonomy logo Reddit logo
Bibliographic Tools

Bibliographic and Citation Tools

Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media

Code, Data and Media Associated with this Article

alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
Links to Code Toggle
Papers with Code (What is Papers with Code?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos

Demos

Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers

Recommenders and Search Tools

Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
About arXivLabs

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.