Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

arXiv cs.CV / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes SSL-V3: a Self-Supervised Learning-based Video Vision Transformer combined with No-reference Video Quality Assessment (VQA) for video classification to address label scarcity in VQA.
It introduces a Combined-SSL mechanism that uses video quality scores to directly tune the feature maps of video classification, linking VQA and classification through a supervised objective to tune VQA.
The approach leverages self-supervised learning to fuse VQA with video recognition and mitigates limited labeled VQA data by using the classification task as supervision.
It reports robust results on two datasets, including an accuracy of 94.87% on interview videos from the I-CONECT healthcare dataset, demonstrating effectiveness.
By explicitly considering video quality, the framework improves both video quality assessment and recognition performance in a joint setting.

Abstract

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

Two bots, one confused server: what Nimbus revealed about AI agent identity

Dev.to

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

Dev.to

A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

MarkTechPost

DNA Memory: Making AI Agents Learn, Forget, and Evolve Like a Human Brain

Dev.to

Tinybox- offline AI device 120B parameters

Hacker News

Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

Key Points

Abstract

Related Articles

Two bots, one confused server: what Nimbus revealed about AI agent identity

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

DNA Memory: Making AI Agents Learn, Forget, and Evolve Like a Human Brain

Tinybox- offline AI device 120B parameters

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer