EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces EuraGovExam, a new multilingual, multimodal benchmark built from real civil service examinations from five Eurasian regions (South Korea, Japan, Taiwan, India, and the European Union).
  • The dataset contains 8,000+ high-resolution scanned multiple-choice questions across 17 domains, with all text and visual elements embedded into single images to test layout-aware reasoning.
  • EuraGovExam differs from prior benchmarks by requiring models to perform cross-lingual, visual-layout reasoning directly from image input rather than relying on separated OCR/text fields.
  • Evaluation results report that even state-of-the-art vision-language models reach only 86% accuracy, highlighting current limitations in handling culturally realistic and visually complex exam documents.
  • The benchmark is positioned to support development and evaluation for e-governance and public-sector document analysis, as well as more equitable multilingual exam preparation.

Abstract

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams | AI Navigate