INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

INDOTABVQA is introduced as a new benchmark for cross-lingual table visual question answering on real Bahasa Indonesia document images, paired with QA sets in four languages (Bahasa Indonesia, English, Hindi, Arabic).
The dataset includes 1,593 document images spanning three visual styles and varying table complexity, enabling evaluation in both monolingual and cross-lingual VQA settings.
Benchmarking shows substantial performance gaps for leading VLMs (including Qwen2.5-VL, Gemma-3, LLaMA-3.2, and GPT-4o), especially on structurally complex tables and in low-resource languages.
Targeted fine-tuning improves accuracy by 11.6% (fine-tuning a compact 3B model) and 17.8% (LoRA fine-tuning a 7B model), indicating that domain-specific training can meaningfully boost results.
Adding explicit table region coordinates as extra input yields an additional 4–7% improvement, highlighting the benefit of spatial priors for structure-aware table reasoning.

Abstract

We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

Black Hat Asia

AI Business

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

Dev.to

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

Reddit r/artificial

Give me your ideass [N]

Reddit r/MachineLearning

Claude Code Plugins for Design Systems & Agent Orchestration for Real Workflows

Dev.to

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Key Points

Abstract

Related Articles

Black Hat Asia

Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention

Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?

Give me your ideass [N]

Claude Code Plugins for Design Systems & Agent Orchestration for Real Workflows

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer