QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

arXiv cs.CV / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces QCalEval, the first benchmark for evaluating how well vision-language models (VLMs) understand quantum calibration plots using 243 samples across 87 scenario types and 22 experimental families.
  • It covers superconducting qubits and neutral atoms, and tests six question types under both zero-shot and in-context learning settings.
  • Results show that the best general-purpose zero-shot model achieves a mean score of 72.3, while many open-weight models perform worse when given multi-image in-context learning prompts.
  • Frontier closed models improve much more in the multi-image in-context learning setting, indicating a meaningful capability gap versus many open-weight systems.
  • A 9B-parameter-scale supervised fine-tuning (SFT) improves zero-shot performance but does not fully eliminate the multimodal in-context learning gap; the authors also release an open-weight reference model, NVIDIA Ising Calibration 1, with a 74.7 zero-shot average score.

Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.