HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces HM-Bench, described as the first benchmark tailored specifically to evaluate multimodal large language models (MLLMs) on hyperspectral remote sensing tasks.
  • HM-Bench contains 19,337 question–answer pairs across 13 categories, spanning from basic perception to more complex spectral reasoning.
  • Because many existing MLLMs cannot ingest raw hyperspectral cubes directly, the authors propose a dual-modality evaluation framework using PCA-based composite images and structured textual reports derived from the HSI.
  • Experiments across 18 representative MLLMs find substantial challenges on complex spatial–spectral reasoning, indicating current models remain weak in this specialized domain.
  • Results also show visual inputs generally outperform textual inputs, emphasizing the need for grounding in spectral–spatial evidence for effective HSI understanding.

Abstract

While multimodal large language models (MLLMs) have made significant strides in natural image understanding, their ability to perceive and reason over hyperspectral image (HSI) remains underexplored, which is a vital modality in remote sensing. The high dimensionality and intricate spectral-spatial properties of HSI pose unique challenges for models primarily trained on RGB data.To address this gap, we introduce Hyperspectral Multimodal Benchmark (HM-Bench), the first benchmark designed specifically to evaluate MLLMs in HSI understanding. We curate a large-scale dataset of 19,337 question-answer pairs across 13 task categories, ranging from basic perception to spectral reasoning. Given that existing MLLMs are not equipped to process raw hyperspectral cubes natively, we propose a dual-modality evaluation framework that transforms HSI data into two complementary representations: PCA-based composite images and structured textual reports. This approach facilitates a systematic comparison of different representation for model performance. Extensive evaluations on 18 representative MLLMs reveal significant difficulties in handling complex spatial-spectral reasoning tasks. Furthermore, our results demonstrate that visual inputs generally outperform textual inputs, highlighting the importance of grounding in spectral-spatial evidence for effective HSI understanding. Dataset and appendix can be accessed at https://github.com/HuoRiLi-Yu/HM-Bench.