AI Navigate

QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

arXiv cs.CL / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a data synthesis framework to address the scarcity of high-quality real-world SVA corpora by using large-scale open-source RTLs to guide LLMs in generating real-world SVAs.
  • It introduces bidirectional translation as a data selection method to reliably determine NL-SVA semantic equivalence.
  • They train CodeV-SVA, a series of SVA generation models, with the synthesized data, with CodeV-SVA-14B achieving 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
  • The work demonstrates the viability of RTL-grounded, domain-specific LLMs for hardware verification tasks and could influence future verification tooling and methodology.

Abstract

SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.