[R] VLMs Behavior for Long Video Understanding

Reddit r/MachineLearning / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post compares existing long video understanding datasets (e.g., Video-MME, MLVU, VideoBench, LongVideoBench) and notes they mainly test category-focused tasks like ordering, counting, and basic reasoning.
The author designs new long-video questions to emphasize multi-step reasoning and finds VLMs fail when answers require free-form generation without options.
However, when the questions are reframed as multiple-choice with four options, the same VLMs achieve 100% accuracy.
The central question raised is why VLM behavior changes so dramatically between open-ended (ground-truth-only) and multiple-choice settings for long-video understanding.

I have extensively searched on long video understanding datasets such as Video-MME, MLVU, VideoBench, LongVideoBench and etc. What I have seen there these datasets are focused on different categories such dramas, films, TV shows, documentaries where focus on tasks like ordering, counting, reasoning and etc.

I feel that multi-step reasoning is less explored and then what i have did i designed the questions with no options just ground truth and asked the VLM to give me the answer but VLMs unable to give the answer. But when i give the 4 options then VLM achieves 100% accuracy.

My point is that why VLMs behave like this?

submitted by /u/Alternative_Art2984
[link] [comments]

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

[R] VLMs Behavior for Long Video Understanding

Key Points

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer