MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation
arXiv cs.CV / 3/26/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces MMTIT-Bench, a human-verified multilingual, multi-scenario benchmark for end-to-end text-image machine translation across 1,400 images in 14 non-English/non-Chinese languages.
- It targets a gap in evaluating vision-language model robustness, especially for diverse visual scenes (e.g., documents, scenes, web images) and low-resource languages.
- The authors propose CPR-Trans (Cognition-Perception-Reasoning for Translation), a reasoning-oriented data paradigm that unifies scene cognition, text perception, and translation reasoning rather than relying on language-only or cascaded workflows.
- A VLLM-driven data generation pipeline is used to create structured and interpretable supervision that aligns perception signals with translation reasoning.
- Experiments on 3B and 7B VLLM models report consistent improvements in both translation accuracy and interpretability, and the authors plan to release the benchmark upon acceptance.
Related Articles
Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog
Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)
Dev.to
Anyone who has any common sense knows that AI agents in marketing just don’t exist.
Dev.to
How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to
The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context
Dev.to