KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

arXiv cs.CL / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces KMMMU, a native Korean multimodal benchmark designed to evaluate understanding under Korean cultural, institutional, and discipline-specific visual conventions rather than English- or translation-based settings.
KMMMU includes 3,466 exam-style Korean questions across nine disciplines and nine visual modality categories, plus a Korean-specific 300-item subset and a 627-question hard subset.
Experimental results show that the best open-source model achieves only 42.05% accuracy on the full set, while the top proprietary model reaches 52.42% on the hard subset.
Performance is uneven across disciplines, with Korean-specific questions exhibiting gaps of up to 13.43%, indicating persistent weaknesses in localized conventions and standards understanding.
Error analysis suggests failures relate to convention-to-label mapping, limited few-shot symbolic induction, localized knowledge recall, and domain-standard comprehension more than to insufficient reasoning depth.

Abstract

We introduce KMMMU, a native Korean benchmark for evaluating multimodal understanding in Korean cultural and institutional settings. KMMMU contains 3,466 questions from exams natively written in Korean, covering nine disciplines and nine visual modality categories, along with a 300-item Korean-specific subset and a hard subset of 627 questions. Unlike translated or English-centric benchmarks, KMMMU targets information-dense problems shaped by local conventions, official standards, and discipline-specific visual formats. Experiments show that the strongest open-source model reaches only 42.05% accuracy on the full set, while the best proprietary model achieves 52.42% on the hard subset. Performance varies across disciplines, with some disciplines emerging as bottlenecks, and Korean-specific questions showing gaps of up to 13.43%. Error analysis suggests that these failures stem less from insufficient reasoning depth than from weak convention-to-label mapping, few-shot symbolic induction, localized knowledge recall, and domain-specific standards understanding. KMMMU provides a testbed for multimodal evaluation beyond English-centric benchmarks and for developing more reliable systems for expert real-world tasks.