Video-ToC: Video Tree-of-Cue Reasoning

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

Key Points

  • The paper introduces Video-ToC, a new framework designed to improve video understanding by adding stronger reasoning capabilities while reducing hallucinations common in existing Video LLMs.
  • Video-ToC’s method relies on three innovations: tree-guided visual cue localization for fine-grained perception, a reasoning-demand reward mechanism to adapt RL incentives dynamically, and an automated pipeline that builds dedicated datasets for SFT and RL.
  • The authors create two datasets—Video-ToC-SFT-1k for supervised fine-tuning and Video-ToC-RL-2k for reinforcement learning—via automated annotation.
  • Experiments across six video understanding benchmarks and one hallucination benchmark show Video-ToC outperforming baselines and more recent approaches.
  • The accompanying code is published on GitHub, enabling others to reproduce and build upon the framework.

Abstract

Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at https://github.com/qizhongtan/Video-ToC.