Can MLLMs "Read" What is Missing?

arXiv cs.AI / 4/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MMTR-Bench, a new benchmark for testing how well multimodal large language models (MLLMs) can reconstruct masked text from visual inputs without relying on explicit prompts.
  • Unlike typical visual question answering, the task isolates text reconstruction from instruction-following, focusing instead on layout understanding, visual grounding, and knowledge integration.
  • MMTR-Bench includes 2,771 multilingual test samples drawn from real-world domains like documents and webpages, with target text lengths that vary.
  • The authors propose a level-aware evaluation protocol to fairly handle the benchmark’s diversity, and experiments show the task is particularly difficult at sentence and paragraph levels.

Abstract

We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.