Multimodal QUD: Inquisitive Questions from Scientific Figures

arXiv cs.CL / 4/28/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “Multimodal QUD (MQUD),” which generates deeper, inquisitive questions grounded in both the figure and the surrounding paper context rather than relying on text-only cues.
  • It extends the Questions Under Discussion (QUD) framework from text-only discourse to multimodal discourse, modeling how implicit questions arise and get resolved over reading.
  • The authors release MQUD, a dataset of research papers where such implicit questions are made explicit and annotated by the original authors.
  • Experiments show that fine-tuning a vision-language model (VLM) on MQUD improves its ability to produce content-specific, visually grounded multimodal questions that require higher-level reasoning.

Abstract

Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper's context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.