Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

arXiv cs.CV / 4/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that existing MLLM benchmarks often test knowledge or basic perception, but under-evaluate the reasoning skill needed to find decisive visual clues in real daily situations.
  • It introduces DailyClue, a new benchmark focused on visual clue-driven reasoning grounded in authentic daily activities and designed to require more than surface-level recognition.
  • DailyClue’s queries push models to actively select and use relevant visual clues for follow-up reasoning, rather than merely identifying objects or attributes.
  • The benchmark includes a dataset covering four daily domains with 16 subtasks, and evaluates both MLLMs and agentic models to highlight the difficulty of clue-based reasoning.
  • The results and analysis emphasize that accurately identifying visual clues is a key prerequisite for robust reasoning performance.

Abstract

Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.