Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

arXiv cs.CV / 3/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing Video Quality Assessment (VQA) approaches often fail to capture “video fluency” well, motivating the creation of Video Fluency Assessment (VFA) as a standalone temporal perceptual task.
It introduces a new fluency-focused benchmark dataset, FluVid, containing 4,606 in-the-wild videos with a balanced fluency distribution and new human study–based scoring criteria.
A large-scale benchmark across 23 methods is presented to evaluate progress on FluVid and to inform VFA-specific model design choices.
The authors propose a baseline model, FluNet, using temporal permuted self-attention (T-PSA) to better encode fluency-relevant cues and improve long-range frame interactions.
Results indicate state-of-the-art performance on the proposed benchmark and provide a research roadmap for further exploration of VFA.

Abstract

Accurately estimating humans' subjective feedback on video fluency, e.g., motion consistency and frame continuity, is crucial for various applications like streaming and gaming. Yet, it has long been overlooked, as prior arts have focused on solving it in the video quality assessment (VQA) task, merely as a sub-dimension of overall quality. In this work, we conduct pilot experiments and reveal that current VQA predictions largely underrepresent fluency, thereby limiting their applicability. To this end, we pioneer Video Fluency Assessment (VFA) as a standalone perceptual task focused on the temporal dimension. To advance VFA research, 1) we construct a fluency-oriented dataset, FluVid, comprising 4,606 in-the-wild videos with balanced fluency distribution, featuring the first-ever scoring criteria and human study for VFA. 2) We develop a large-scale benchmark of 23 methods, the most comprehensive one thus far on FluVid, gathering insights for VFA-tailored model designs. 3) We propose a baseline model called FluNet, which deploys temporal permuted self-attention (T-PSA) to enrich input fluency information and enhance long-range inter-frame interactions. Our work not only achieves state-of-the-art performance but, more importantly, offers the community a roadmap to explore solutions for VFA.

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Pioneering Perceptual Video Fluency Assessment: A Novel Task with Benchmark Dataset and Baseline

Key Points

Abstract

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer