Beyond Single-Sample: Reliable Multi-Sample Distillation for Video Understanding
arXiv cs.CV / 3/13/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- Proposes R-MSD (Reliable Multi-Sample Distillation), a framework that models teacher sampling variance and uses a task-adaptive teacher pool to provide robust supervision for video understanding with LVLMs.
- Introduces quality-aware signal matching combined with an adversarial distillation objective to filter teacher noise and maximize knowledge transfer.
- Extensive evaluations on video understanding benchmarks show R-MSD consistently outperforms single-sample distillation methods.
- With a 4B student model, R-MSD achieves gains on VideoMME (+1.5%), Video-MMMU (+3.2%), and MathVerse (+3.6%), and outperforms a 4B SFT+RL baseline under the same training budget.




