QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

arXiv cs.CV / 4/29/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces QCalEval, the first benchmark for evaluating how well vision-language models (VLMs) understand quantum calibration plots using 243 samples across 87 scenario types and 22 experimental families.
It covers superconducting qubits and neutral atoms, and tests six question types under both zero-shot and in-context learning settings.
Results show that the best general-purpose zero-shot model achieves a mean score of 72.3, while many open-weight models perform worse when given multi-image in-context learning prompts.
Frontier closed models improve much more in the multi-image in-context learning setting, indicating a meaningful capability gap versus many open-weight systems.
A 9B-parameter-scale supervised fine-tuning (SFT) improves zero-shot performance but does not fully eliminate the multimodal in-context learning gap; the authors also release an open-weight reference model, NVIDIA Ising Calibration 1, with a 74.7 zero-shot average score.

Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer