ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

ViSAGE（Video Saliency with Adaptive Gated Experts）は、NTIRE 2026のVideo Saliency Predictionチャレンジ向けに提案されたマルチエキスパート・アンサンブル手法です。
専用のデコーダごとに適応的なゲーティングとモジュレーションを行い、動画の時空間特徴を段階的に洗練していく設計です。
複数エキスパートの予測を推論時に融合することで、相補的な帰納バイアスを集約し、複雑な注目（サリエンシー）手がかりを捉えることを狙います。
プライベートテストでは4指標中2指標で1位、他2指標でも多くの競合を上回り、汎化性能の高さを示したと報告されています。
実装コードは指定GitHubリポジトリで公開されています。

Abstract

In this report, we present our champion solution for the NTIRE 2026 Challenge on Video Saliency Prediction held in conjunction with CVPR 2026. To exploit complementary inductive biases for video saliency, we propose Video Saliency with Adaptive Gated Experts (ViSAGE), a multi-expert ensemble framework. Each specialized decoder performs adaptive gating and modulation to refine spatio-temporal features. The complementary predictions from different experts are then fused at inference. ViSAGE thereby aggregates diverse inductive biases to capture complex spatio-temporal saliency cues in videos. On the Private Test set, ViSAGE ranked first on two out of four evaluation metrics, and outperformed most competing solutions on the other two metrics, demonstrating its effectiveness and generalization ability. Our code has been released at https://github.com/iLearn-Lab/CVPRW26-ViSAGE.