Can Multimodal Large Language Models Truly Understand Small Objects?

arXiv cs.CV / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces SOUBench, the first comprehensive benchmark aimed at evaluating Multimodal Large Language Models’ (MLLMs) ability to understand small objects (SOU), a capability that has been largely unexamined so far.
The authors create SOU-VQA, an evaluation dataset of 18,204 visual question-answer pairs across six sub-tasks and three major scenarios (Driving, Aerial, and Underwater), enabled by an automatic visual QA generation strategy.
Testing 15 state-of-the-art MLLMs shows they perform weakly on small object understanding, suggesting a real limitation rather than a mere lack of coverage.
To address this, the paper releases SOU-Train (11,226 VQA pairs) for multimodal training, and demonstrates that supervised fine-tuning with SOU-Train can improve an MLLM’s small-object understanding.
The work provides both benchmark and training resources (plus code) to support further research into building MLLMs with stronger small-object reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have shown promising potential in diverse understanding tasks, e.g., image and video analysis, math and physics olympiads. However, they remain blank and unexplored for Small Object Understanding (SOU) tasks. To fill this gap, we introduce SOUBench, the first and comprehensive benchmark for exploring the small objects understanding capability of existing MLLMs. Specifically, we first design an effective and automatic visual question-answer generation strategy, constructing a new SOU-VQA evaluation dataset, with 18,204 VQA pairs, six relevant sub-tasks, and three dominant scenarios (i.e., Driving, Aerial, and Underwater). Then, we conduct a comprehensive evaluation on 15 state-of-the-art MLLMs and reveal their weak capabilities in small object understanding. Furthermore, we develop SOU-Train, a multimodal training dataset with 11,226 VQA pairs, to improve the SOU capabilities of MLLMs. Through supervising fine-tuning of the latest MLLM, we demonstrate that SOU-Train can effectively enhance the latest MLLM's ability to understand small objects. Comprehensive experimental results demonstrate that, the proposed SOUBench, along with the SOU-VQA and SOU-Train datasets, provides a crucial empirical foundation to the community for further developing models with enhanced small object understanding capabilities. Datasets and Code: https://github.com/Hanfj-X/SOU.

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Dev.to

Top 10 Physical AI Models Powering Real-World Robots in 2026

MarkTechPost

Can Multimodal Large Language Models Truly Understand Small Objects?

Key Points

Abstract

Related Articles

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Real-Time Monitoring for AI Agents: Beyond Log Streaming

Top 10 Physical AI Models Powering Real-World Robots in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer