QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding
arXiv cs.CV / 4/29/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces QCalEval, the first benchmark for evaluating how well vision-language models (VLMs) understand quantum calibration plots using 243 samples across 87 scenario types and 22 experimental families.
- It covers superconducting qubits and neutral atoms, and tests six question types under both zero-shot and in-context learning settings.
- Results show that the best general-purpose zero-shot model achieves a mean score of 72.3, while many open-weight models perform worse when given multi-image in-context learning prompts.
- Frontier closed models improve much more in the multi-image in-context learning setting, indicating a meaningful capability gap versus many open-weight systems.
- A 9B-parameter-scale supervised fine-tuning (SFT) improves zero-shot performance but does not fully eliminate the multimodal in-context learning gap; the authors also release an open-weight reference model, NVIDIA Ising Calibration 1, with a 74.7 zero-shot average score.
Related Articles
LLMs will be a commodity
Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring
Dev.to