K-MetBench: A Multi-Dimensional Benchmark for Fine-Grained Evaluation of Expert Reasoning, Locality, and Multimodality in Meteorology

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

Key Points

  • K-MetBench is introduced as a multidimensional, expert-level benchmark for evaluating multimodal large language model assistants used by Korean meteorology forecasters.
  • The benchmark is grounded in authoritative materials (national qualification exams) and assesses four dimensions: chart visual reasoning, logical validity with expert-verified rationales, Korean geo-cultural understanding, and fine-grained domain analysis.
  • Testing 55 models finds two major weaknesses: a modality gap in interpreting specialized meteorological diagrams and a reasoning gap where models can predict correctly while still hallucinating or producing illogical explanations.
  • Results show that Korean models significantly outperform larger global models in local contexts, indicating that parameter scaling alone does not fix cultural or locality-dependent understanding.
  • The authors provide the dataset publicly on Hugging Face and position K-MetBench as a guide for building reliable, culturally aware expert AI agents in meteorology.

Abstract

The development of practical (multimodal) large language model assistants for Korean weather forecasters is hindered by the absence of a multidimensional, expert-level evaluation framework grounded in authoritative sources. To address this, we introduce K-MetBench, a diagnostic benchmark grounded in national qualification exams. It exposes critical gaps across four dimensions: expert visual reasoning of charts, logical validity via expert-verified rationales, Korean-specific geo-cultural comprehension, and fine-grained domain analysis. Our evaluation of 55 models reveals a profound modality gap in interpreting specialized diagrams and a reasoning gap where models hallucinate logic despite correct predictions. Crucially, Korean models outperform significantly larger global models in local contexts, demonstrating that parameter scaling alone cannot resolve cultural dependencies. K-MetBench serves as a roadmap for developing reliable, culturally aware expert AI agents. The dataset is available at https://huggingface.co/datasets/soyeonbot/K-MetBench .