SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

arXiv cs.AI / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SpecVQA is a new scientific-image benchmark designed to evaluate multimodal large language models’ (MLLMs) spectral understanding using expert-annotated visual question-answer pairs.
The benchmark covers seven representative spectrum types and includes 620 figures with 3,100 curated QA pairs drawn from peer-reviewed literature, supporting both information extraction and domain-specific reasoning.
The authors propose a spectral data sampling and interpolation reconstruction method to reduce token length while preserving critical curve characteristics, and ablation studies show performance gains.
The paper evaluates several leading MLLMs on SpecVQA and provides a leaderboard to compare capabilities in scientific spectral QA.
Overall, the work aims to advance spectral understanding for multimodal large models and offers directions for extending visual-language models to broader scientific research and data analysis.

Abstract

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-specific characteristics. Here we introduce SpecVQA, a professional scientific-image benchmark for evaluating multimodal models on scientific spectral understanding, covering 7 representative spectrum types with expert-annotated question-answer pairs. The aim comprises two aspects: spectra scientific QA evaluation and corresponding underlying task evaluation. SpecVQA contains 620 figures and 3100 QA pairs curated from peer-reviewed literature, targeting both direct information extraction and domain-specific reasoning. To effectively reduce token length while preserving essential curve characteristics, we propose a spectral data sampling and interpolation reconstruction approach. Ablation studies further confirm that the approach achieves substantial performance improvements on the proposed benchmark. We test the capability of prominent MLLMs in scientific spectral understanding on our benchmark and present a leaderboard. This work represents an essential step toward enhancing spectral understanding in multimodal large models and suggests promising directions for extending visual-language models to broader scientific research and data analysis.