ChipCraftBrain: Validation-First RTL Generation via Multi-Agent Orchestration

arXiv cs.AI / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • ChipCraftBrain is a new framework for generating RTL from natural-language specs that targets the low functional-correctness of prior single-shot LLM approaches (about 60–65% on benchmarks).
  • It uses adaptive multi-agent orchestration with six specialized agents controlled by a PPO policy (168-dimensional state), alongside an evaluated alternative MPC-style planner.
  • The system combines hybrid symbolic-neural reasoning—solving K-map and truth-table tasks algorithmically—with agent-based handling of waveform timing and general RTL generation.
  • Knowledge-augmented generation is driven by a pattern base (321 patterns) plus focus-aware retrieval from 971 open-source RTL implementations, and specifications are decomposed hierarchically into dependency-ordered sub-modules with interface synchronization.
  • Results show strong benchmark gains: 97.2% mean pass@1 on VerilogEval-Human, 94.7% mean pass@1 on a CVDP non-agentic subset, and an 8/8 lint-passing RISC-V SoC case study validated on FPGA where monolithic generation failed.

Abstract

Large Language Models (LLMs) show promise for generating Register-Transfer Level (RTL) code from natural language specifications, but single-shot generation achieves only 60-65% functional correctness on standard benchmarks. Multi-agent approaches such as MAGE reach 95.9% on VerilogEval yet remain untested on harder industrial benchmarks such as NVIDIA's CVDP, lack synthesis awareness, and incur high API costs. We present ChipCraftBrain, a framework combining symbolic-neural reasoning with adaptive multi-agent orchestration for automated RTL generation. Four innovations drive the system: (1) adaptive orchestration over six specialized agents via a PPO policy over a 168-dim state (an alternative world-model MPC planner is also evaluated); (2) a hybrid symbolic-neural architecture that solves K-map and truth-table problems algorithmically while specialized agents handle waveform timing and general RTL; (3) knowledge-augmented generation from a 321-pattern base plus 971 open-source reference implementations with focus-aware retrieval; and (4) hierarchical specification decomposition into dependency-ordered sub-modules with interface synchronization. On VerilogEval-Human, ChipCraftBrain achieves 97.2% mean pass@1 (range 96.15-98.72% across 7 runs, best 154/156), on par with ChipAgents (97.4%, self-reported) and ahead of MAGE (95.9%). On a 302-problem non-agentic subset of CVDP spanning five task categories, we reach 94.7% mean pass@1 (286/302, averaged over 3 runs), a 36-60 percentage-point lift per category over the published single-shot baseline; we additionally lead three of four categories shared with NVIDIA's ACE-RTL despite using roughly 30x fewer per-problem attempts. A RISC-V SoC case study demonstrates hierarchical decomposition generating 8/8 lint-passing modules (689 LOC) validated on FPGA, where monolithic generation fails entirely.