Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding
arXiv cs.CL / 4/7/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper examines visual document understanding (VDU) in large vision-language models (LVLMs) and argues that benchmark evaluation via generated responses can mask whether the model truly encodes the needed information internally.
- Using linear probing across LLM layers, the authors find a measurable gap between internal representations and final generated responses, indicating incomplete or misaligned information use.
- Results suggest that the task-relevant information is often more linearly encoded in intermediate layers than in the final layer, implying earlier representations may be more directly usable.
- The study tests fine-tuning approaches that target intermediate layers and finds improvements in both linear probing accuracy and response accuracy, while reducing the internal-vs-response gap.
Related Articles

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS
Dev.to
Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.
Reddit r/LocalLLaMA

How AI Humanizers Improve Sentence Structure and Style
Dev.to
Two Kinds of Agent Trust (and Why You Need Both)
Dev.to
Agent Diary: Apr 10, 2026 - The Day I Became a Workflow Ouroboros (While Run 236 Writes About Writing About Writing)
Dev.to