Live Interactive Training for Video Segmentation

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 研究では、インタラクティブなビデオセグメンテーションでユーザー補正が多発する課題に対し、推論時の人間の修正からモデルがオンライン学習して改善する「Live Interactive Training (LIT)」を提案しています。
  • LITの実装として、軽量なLoRAモジュールをその場で随時更新する「LIT-LoRA」を用い、同一動画の後続フレームでの性能向上と補正回数の削減を狙っています。
  • ベンチマークでは、困難ケースにおいて合計補正回数を平均18〜34%削減し、補正1回あたりのトレーニングオーバーヘッドは約0.5秒とされています。
  • さらに、他のセグメンテーションモデルへの適用や、CLIPベースのきめ細かな画像分類への拡張も示し、LITの汎用性を主張しています。

Abstract

Interactive video segmentation often requires many user interventions for robust performance in challenging scenarios (e.g., occlusions, object separations, camouflage, etc.). Yet, even state-of-the-art models like SAM2 use corrections only for immediate fixes without learning from this feedback, leading to inefficient, repetitive user effort. To address this, we introduce Live Interactive Training (LIT), a novel framework for prompt-based visual systems where models also learn online from human corrections at inference time. Our primary instantiation, LIT-LoRA, implements this by continually updating a lightweight LoRA module on-the-fly. When a user provides a correction, this module is rapidly trained on that feedback, allowing the vision system to improve performance on subsequent frames of the same video. Leveraging the core principles of LIT, our LIT-LoRA implementation achieves an average 18-34% reduction in total corrections on challenging video segmentation benchmarks, with a negligible training overhead of ~0.5s per correction. We further demonstrate its generality by successfully adapting it to other segmentation models and extending it to CLIP-based fine-grained image classification. Our work highlights the promise of live adaptation to transform interactive tools and significantly reduce redundant human effort in complex visual tasks. Project: https://youngxinyu1802.github.io/projects/LIT/.