InstrAct: Towards Action-Centric Understanding in Instructional Videos

arXiv cs.AI / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that instructional video understanding requires fine-grained action recognition and temporal relation modeling, which existing Video Foundation Models struggle with due to noisy web supervision and a “static bias” toward objects over motion cues.
It proposes InstrAction, a pretraining framework that filters noisy captions, creates action-centric hard negatives for contrastive learning, and uses an Action Perceiver to extract motion-relevant tokens from redundant video encodings.
InstrAction further improves temporal and cross-modal understanding via two auxiliary objectives: DTW-Align for sequential structure alignment and Masked Action Modeling (MAM) for stronger grounding between video and instructions.
The authors introduce the InstrAct Bench to evaluate action-centric understanding and report consistent improvements over state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.

Abstract

Understanding instructional videos requires recognizing fine-grained actions and modeling their temporal relations, which remains challenging for current Video Foundation Models (VFMs). This difficulty stems from noisy web supervision and a pervasive "static bias", where models rely on objects rather than motion cues. To address this, we propose InstrAction, a pretraining framework for instructional videos' action-centric representations. We first introduce a data-driven strategy, which filters noisy captions and generates action-centric hard negatives to disentangle actions from objects during contrastive learning. At the visual feature level, an Action Perceiver extracts motion-relevant tokens from redundant video encodings. Beyond contrastive learning, we introduce two auxiliary objectives: Dynamic Time Warping alignment (DTW-Align) for modeling sequential temporal structure, and Masked Action Modeling (MAM) for strengthening cross-modal grounding. Finally, we introduce the InstrAct Bench to evaluate action-centric understanding, where our method consistently outperforms state-of-the-art VFMs on semantic reasoning, procedural logic, and fine-grained retrieval tasks.

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)

Dev.to

Free AI Tools With No Message Limits — The Definitive List (2026)

Dev.to

Why Domain Knowledge Is Critical in Healthcare Machine Learning

Dev.to

InstrAct: Towards Action-Centric Understanding in Instructional Videos

Key Points

Abstract

Related Articles

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

วิธีใช้ AI ทำ SEO ให้เว็บติดอันดับ Google (2026)

Free AI Tools With No Message Limits — The Definitive List (2026)

Why Domain Knowledge Is Critical in Healthcare Machine Learning

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer