EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces EuraGovExam, a new multilingual, multimodal benchmark built from real civil service examinations from five Eurasian regions (South Korea, Japan, Taiwan, India, and the European Union).
The dataset contains 8,000+ high-resolution scanned multiple-choice questions across 17 domains, with all text and visual elements embedded into single images to test layout-aware reasoning.
EuraGovExam differs from prior benchmarks by requiring models to perform cross-lingual, visual-layout reasoning directly from image input rather than relying on separated OCR/text fields.
Evaluation results report that even state-of-the-art vision-language models reach only 86% accuracy, highlighting current limitations in handling culturally realistic and visually complex exam documents.
The benchmark is positioned to support development and evaluation for e-governance and public-sector document analysis, as well as more equitable multilingual exam preparation.

Abstract

We present EuraGovExam, a multilingual and multimodal benchmark sourced from real-world civil service examinations across five representative Eurasian regions: South Korea, Japan, Taiwan, India, and the European Union. Designed to reflect the authentic complexity of public-sector assessments, the dataset contains over 8,000 high-resolution scanned multiple-choice questions covering 17 diverse academic and administrative domains. Unlike existing benchmarks, EuraGovExam embeds all question content--including problem statements, answer choices, and visual elements--within a single image, providing only a minimal standardized instruction for answer formatting. This design demands that models perform layout-aware, cross-lingual reasoning directly from visual input. All items are drawn from real exam documents, preserving rich visual structures such as tables, multilingual typography, and form-like layouts. Evaluation results show that even state-of-the-art vision-language models (VLMs) achieve only 86% accuracy, underscoring the benchmark's difficulty and its power to diagnose the limitations of current models. By emphasizing cultural realism, visual complexity, and linguistic diversity, EuraGovExam establishes a new standard for evaluating VLMs in high-stakes, multilingual, image-grounded settings. It also supports practical applications in e-governance, public-sector document analysis, and equitable exam preparation.

Black Hat Asia

AI Business

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

Dev.to

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

EuraGovExam: A Multilingual Multimodal Benchmark from Real-World Civil Service Exams

Key Points

Abstract

Related Articles

Black Hat Asia

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer