Can MLLMs "Read" What is Missing?

arXiv cs.AI / 4/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces MMTR-Bench, a new benchmark for testing how well multimodal large language models (MLLMs) can reconstruct masked text from visual inputs without relying on explicit prompts.
Unlike typical visual question answering, the task isolates text reconstruction from instruction-following, focusing instead on layout understanding, visual grounding, and knowledge integration.
MMTR-Bench includes 2,771 multilingual test samples drawn from real-world domains like documents and webpages, with target text lengths that vary.
The authors propose a level-aware evaluation protocol to fairly handle the benchmark’s diversity, and experiments show the task is particularly difficult at sentence and paragraph levels.

Abstract

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.