Reinforcing Consistency in Video MLLMs with Structured Rewards

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a key failure mode in video multimodal large language models (MLLMs): outputs can sound globally plausible while lacking faithful visual and temporal grounding (e.g., hallucinated objects, wrong attributes, or collapsed repeated events).
  • It introduces a compositional consistency audit that decomposes captions into factual and temporal claims to check whether correct high-level answers are backed by valid lower-level evidence, finding that even correct root relational claims often lack reliable attribute/existence support.
  • It argues that sentence-level supervision and sentence-level RL rewards are too coarse to localize specific grounding failures that matter for faithful video understanding.
  • The authors propose a structured reinforcement learning reward composed of (1) an instance-aware scene-graph factual reward, (2) a temporal reward for event ordering/repetition, and (3) a video-grounded VQA hierarchical self-verification reward.
  • Experiments on temporal, general video understanding, and hallucination-focused benchmarks show consistent improvements across open-source MLLM backbones, supporting structured reward shaping as a practical path to more faithful video reasoning.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.