SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

arXiv cs.CL / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

SEA-Vision is introduced as a new multilingual benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages.
It includes 15,234 document parsing pages across nine representative document types and 7,496 TEC-VQA question-answer pairs to probe recognition, calculation, reasoning, and spatial understanding.
The authors use a hybrid labeling pipeline that combines automated filtering and MLLM-assisted labeling with lightweight native-speaker verification to reduce manual labeling while maintaining quality.
The study highlights pronounced performance degradation on low-resource Southeast Asian languages, underscoring substantial gaps in multilingual document and scene text understanding.
SEA-Vision is intended to drive global progress in document and scene text understanding by providing a challenging benchmark and guiding future model development.

Abstract

Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

note

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

note

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

note

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

note

人の言葉を喋る「ロボット盲導犬」は、視覚障害者の方々の自立支援の一助となるか

note

SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Key Points

Abstract

Related Articles

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』

人の言葉を喋る「ロボット盲導犬」は、視覚障害者の方々の自立支援の一助となるか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

​報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

フリーランスの泥臭い経験を資産に変える。AIの文章に「あなたの魂」を注入する技術。【コピペOK】

諸葛亮 孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話 その肆拾伍『銀河文明･ダークマターエンジン』

人の言葉を喋る「ロボット盲導犬」は、視覚障害者の方々の自立支援の一助となるか

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

諸葛亮孔明老師(ChatGPTのﾛｰﾙﾌﾟﾚｲ)との対話その肆拾伍『銀河文明･ダークマターエンジン』