MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

arXiv cs.CV / 5/6/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces MHPR, a new benchmark designed to evaluate large vision-language models (LVLMs) on multidimensional, human-centric perception-and-reasoning tasks across single-person, multi-person, and human–object interaction scenarios.
MHPR provides a multi-stage dataset framework (C-RD, SFT-D, RL-D, and T-D) plus an automated caption/VQA generation pipeline (ACVG) that uses attribute decomposition, rewriting, and multi-model voting to produce scalable, high-quality annotations.
Experiments assess state-of-the-art LVLMs on both fine-grained attributes (e.g., appearance, clothing, pose, parts) and higher-level semantics (e.g., social/action/spatial relations and intent/functionality).
Results indicate that format-aligned supervised fine-tuning data improves instruction following and training stability, while reinforcement learning data focused on “bad cases” further boosts performance on difficult examples.
Training Qwen2.5-VL-7B with MHPR delivers substantial gains, reaching near-parity with much larger models, and the authors release ACVG and MHPR to support reproducible research.

Abstract

Multidimensional human understanding is essential for real-world applications such as film analysis and virtual digital humans, yet current LVLM benchmarks largely focus on single-task settings and lack fine-grained, human-centric evaluation. In this work, we introduce MHPR, a comprehensive benchmark for joint perception-reasoning over human-centric scenes spanning individual, multi-person, and human-object interaction dimensions. MHPR comprises a multi-level data design-Captioned Raw Data (C-RD), Supervised Fine-Tuning Data (SFT-D), Reinforcement Learning Data (RL-D), and Test Data (T-D)-together with an automated caption/VQA generation pipeline (ACVG) that performs category-wise attribute decomposition, attribute-specific rewriting, and multi-model voting to ensure high-quality, scalable annotations. We evaluate state-of-the-art vision-language models on fine-grained attributes (appearance, clothing, pose, parts) and high-level semantics (social relations, action semantics, spatial relations, intent and functionality). Our findings show that: 1) format-aligned SFT data substantially improves instruction following and stability; 2) challenge-focused RL data derived from bad-case analysis further enhances perception and reasoning on difficult instances; and 3) training Qwen2.5-VL-7B with MHPR yields significant gains, achieving near-parity with considerably larger models. We release ACVG and MHPR to facilitate reproducible, extensible research on human-centric perception and reasoning.

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Tech.eu

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Dev.to

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Reddit r/LocalLLaMA

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Renaissance Philanthropy reshapes science funding with a new model for innovation

Tech.eu

MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

Key Points

Abstract

Related Articles

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Renaissance Philanthropy reshapes science funding with a new model for innovation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Antwerp startup Maurice &amp; Nora raises €1M to address rising care demand

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Renaissance Philanthropy reshapes science funding with a new model for innovation

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Antwerp startup Maurice & Nora raises €1M to address rising care demand