QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces QuanBench+, a unified benchmark for LLM-based quantum code generation that aligns tasks across Qiskit, PennyLane, and Cirq to reduce confounding from framework-specific knowledge.
It includes 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation, and evaluates models using executable functional tests plus metrics such as Pass@1/Pass@5 and KL-divergence-based acceptance for probabilistic outputs.
The study measures not only one-shot performance but also “feedback-based repair,” where models revise code after runtime errors or incorrect answers, leading to substantial gains in best scores across all three frameworks.
Reported best one-shot Pass@1 results are 59.5% (Qiskit), 54.8% (Cirq), and 42.9% (PennyLane), while feedback-based repair boosts them to 83.3%, 76.2%, and 66.7% respectively.
Overall, the findings suggest meaningful progress but indicate that reliable multi-framework quantum code generation is still largely unresolved and remains strongly dependent on framework-specific knowledge.

Abstract

Large Language Models (LLMs) are increasingly used for code generation, yet quantum code generation is still evaluated mostly within single frameworks, making it difficult to separate quantum reasoning from framework familiarity. We introduce QuanBench+, a unified benchmark spanning Qiskit, PennyLane, and Cirq, with 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation. We evaluate models with executable functional tests, report Pass@1 and Pass@5, and use KL-divergence-based acceptance for probabilistic outputs. We additionally study Pass@1 after feedback-based repair, where a model may revise code after a runtime error or wrong answer. Across frameworks, the strongest one-shot scores reach 59.5% in Qiskit, 54.8% in Cirq, and 42.9% in PennyLane; with feedback-based repair, the best scores rise to 83.3%, 76.2%, and 66.7%, respectively. These results show clear progress, but also that reliable multi-framework quantum code generation remains unsolved and still depends strongly on framework-specific knowledge.

How to Deploy Llama 2 on DigitalOcean App Platform for $5/Month

Dev.to

Bata India’s CIO on rebuilding retail tech for an AI-first future –

Dev.to

AI Science & Economy: Systems Map

Reddit r/artificial

Your Job in 2027: Content Writer & Marketing Manager After AI

Dev.to

Your Job in 2027: HR & Recruitment Specialist After AI

Dev.to

QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

Key Points

Abstract

Related Articles

How to Deploy Llama 2 on DigitalOcean App Platform for $5/Month

Bata India’s CIO on rebuilding retail tech for an AI-first future –

AI Science & Economy: Systems Map

Your Job in 2027: Content Writer & Marketing Manager After AI

Your Job in 2027: HR & Recruitment Specialist After AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

How to Deploy Llama 2 on DigitalOcean App Platform for $5/Month

Bata India’s CIO on rebuilding retail tech for an AI-first future &#8211;

AI Science & Economy: Systems Map

Your Job in 2027: Content Writer & Marketing Manager After AI

Your Job in 2027: HR & Recruitment Specialist After AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Bata India’s CIO on rebuilding retail tech for an AI-first future –