SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
arXiv cs.CL / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- SEA-Vision is introduced as a new multilingual benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages.
- It includes 15,234 document parsing pages across nine representative document types and 7,496 TEC-VQA question-answer pairs to probe recognition, calculation, reasoning, and spatial understanding.
- The authors use a hybrid labeling pipeline that combines automated filtering and MLLM-assisted labeling with lightweight native-speaker verification to reduce manual labeling while maintaining quality.
- The study highlights pronounced performance degradation on low-resource Southeast Asian languages, underscoring substantial gaps in multilingual document and scene text understanding.
- SEA-Vision is intended to drive global progress in document and scene text understanding by providing a challenging benchmark and guiding future model development.




