HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

arXiv cs.AI / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces HWE-Bench, a large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks, addressing a gap in prior component-only hardware benchmarks.
HWE-Bench contains 417 task instances drawn from historical bug-fix pull requests across six major open-source projects (Verilog/SystemVerilog and Chisel), including RISC-V cores, SoCs, and security roots-of-trust.
Each task runs in a fully containerized environment and requires agents to resolve an actual bug report, with correctness verified via the project’s native simulation and regression workflows.
In evaluations of seven LLMs using four agent frameworks, the top agent solves 70.7% of tasks overall, but performance declines to below 65% on complex SoC-level projects despite exceeding 90% on smaller cores.
Failure analysis attributes most agent shortcomings to three debugging stages—fault localization, hardware-semantic reasoning, and coordination across multiple artifacts spanning RTL, configuration, and verification—and suggests concrete directions for improving hardware-aware agents.

Abstract

Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.