[R] VLMs Behavior for Long Video Understanding

Reddit r/MachineLearning / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The post compares existing long video understanding datasets (e.g., Video-MME, MLVU, VideoBench, LongVideoBench) and notes they mainly test category-focused tasks like ordering, counting, and basic reasoning.
  • The author designs new long-video questions to emphasize multi-step reasoning and finds VLMs fail when answers require free-form generation without options.
  • However, when the questions are reframed as multiple-choice with four options, the same VLMs achieve 100% accuracy.
  • The central question raised is why VLM behavior changes so dramatically between open-ended (ground-truth-only) and multiple-choice settings for long-video understanding.

I have extensively searched on long video understanding datasets such as Video-MME, MLVU, VideoBench, LongVideoBench and etc. What I have seen there these datasets are focused on different categories such dramas, films, TV shows, documentaries where focus on tasks like ordering, counting, reasoning and etc.

I feel that multi-step reasoning is less explored and then what i have did i designed the questions with no options just ground truth and asked the VLM to give me the answer but VLMs unable to give the answer. But when i give the 4 options then VLM achieves 100% accuracy.

My point is that why VLMs behave like this?

submitted by /u/Alternative_Art2984
[link] [comments]