Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages

arXiv cs.CL / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces what it claims is the first cross-lingual visual reasoning audit for multiple Indian languages using 980 translated questions across MathVista, ScienceQA, and MMMU.
Using IndicTrans2 for translation and Gemini 2.0 Flash for cross-verification on sample sets, the authors report solid inter-translator agreement (0.79–0.84) before evaluating eight vision-language models across seven languages.
Results show a substantial accuracy drop of 9.8–25 percentage points when moving from English to Indian languages, with Dravidian languages experiencing up to 13.2 pp more drop than Indo-Aryan languages.
Chain-of-thought prompting generally harms performance for Bengali and Kannada rather than improving it, suggesting that many “reasoning chains” are English-centric.
Even a multilingual VLM (Aya-Vision-8B) still shows a large drop (28.5 pp) on Dravidian scripts, and the authors release the benchmark plus all model outputs.

Abstract

Vision-language models score well on mathematical, scientific, and spatial reasoning benchmarks, yet these evaluations are overwhelmingly English. I present the first cross-lingual visual reasoning audit for Indian languages. 980 questions from MathVista, ScienceQA, and MMMU are translated into Hindi, Tamil, Telugu, Bengali, Kannada, and Marathi using IndicTrans2, with Gemini 2.0 Flash cross-verification on 50 samples per language (inter-translator agreement 0.79-0.84). Eight VLMs, from 7B open-source models to GPT-4o, are evaluated across all seven languages, yielding 68,600 inference records that include text-only and chain-of-thought ablations. I find accuracy drops of 9.8-25 percentage points when switching from English to an Indian language, with Dravidian languages suffering up to 13.2 pp more than Indo-Aryan. Chain-of-thought prompting degrades Bengali (-14.4 pp) and Kannada (-11.4 pp) rather than helping, exposing English-centric reasoning chains. Aya-Vision-8B, built for 23 languages, still drops 28.5 pp on Dravidian scripts; multilingual pretraining alone does not transfer visual reasoning. I release the translated benchmark and all model outputs.

Black Hat Asia

AI Business

Just a helpful open-source contributor

Reddit r/LocalLLaMA

v0.18.2rc0

vLLM Releases

South Korean AI Chipmaker Raises $400 Million for Inference

AI Business

Ollama is now powered by MLX on Apple Silicon in preview

Dev.to

Do Multilingual VLMs Reason Equally? A Cross-Lingual Visual Reasoning Audit for Indian Languages

Key Points

Abstract

Related Articles

Black Hat Asia

Just a helpful open-source contributor

v0.18.2rc0

South Korean AI Chipmaker Raises $400 Million for Inference

Ollama is now powered by MLX on Apple Silicon in preview

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer