Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition
arXiv cs.CV / 3/12/2026
📰 NewsModels & Research
Key Points
- The paper proposes SSL-V3: a Self-Supervised Learning-based Video Vision Transformer combined with No-reference Video Quality Assessment (VQA) for video classification to address label scarcity in VQA.
- It introduces a Combined-SSL mechanism that uses video quality scores to directly tune the feature maps of video classification, linking VQA and classification through a supervised objective to tune VQA.
- The approach leverages self-supervised learning to fuse VQA with video recognition and mitigates limited labeled VQA data by using the classification task as supervision.
- It reports robust results on two datasets, including an accuracy of 94.87% on interview videos from the I-CONECT healthcare dataset, demonstrating effectiveness.
- By explicitly considering video quality, the framework improves both video quality assessment and recognition performance in a joint setting.


