HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

arXiv cs.AI / 4/17/2026

📰 NewsModels & Research

Key Points

  • The paper introduces HWE-Bench, a large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks, addressing a gap in prior component-only hardware benchmarks.
  • HWE-Bench contains 417 task instances drawn from historical bug-fix pull requests across six major open-source projects (Verilog/SystemVerilog and Chisel), including RISC-V cores, SoCs, and security roots-of-trust.
  • Each task runs in a fully containerized environment and requires agents to resolve an actual bug report, with correctness verified via the project’s native simulation and regression workflows.
  • In evaluations of seven LLMs using four agent frameworks, the top agent solves 70.7% of tasks overall, but performance declines to below 65% on complex SoC-level projects despite exceeding 90% on smaller cores.
  • Failure analysis attributes most agent shortcomings to three debugging stages—fault localization, hardware-semantic reasoning, and coordination across multiple artifacts spanning RTL, configuration, and verification—and suggests concrete directions for improving hardware-aware agents.

Abstract

Existing benchmarks for hardware design primarily evaluate Large Language Models (LLMs) on isolated, component-level tasks such as generating HDL modules from specifications, leaving repository-scale evaluation unaddressed. We introduce HWE-Bench, the first large-scale, repository-level benchmark for evaluating LLM agents on real-world hardware bug repair tasks. HWE-Bench comprises 417 task instances derived from real historical bug-fix pull requests across six major open-source projects spanning both Verilog/SystemVerilog and Chisel, covering RISC-V cores, SoCs, and security roots-of-trust. Each task is grounded in a fully containerized environment where the agent must resolve a real bug report, with correctness validated through the project's native simulation and regression flows. The benchmark is built through a largely automated pipeline that enables efficient expansion to new repositories. We evaluate seven LLMs with four agent frameworks and find that the best agent resolves 70.7% of tasks overall, with performance exceeding 90% on smaller cores but dropping below 65% on complex SoC-level projects. We observe larger performance gaps across models than commonly reported on software benchmarks, and difficulty is driven by project scope and bug-type distribution rather than code size alone. Our failure analysis traces agent failures to three stages of the debugging process: fault localization, hardware-semantic reasoning, and cross-artifact coordination across RTL, configuration, and verification components, providing concrete directions for developing more capable hardware-aware agents.