Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning
arXiv cs.CV / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SCORE (Surprise-augmented token COmpression via Reinforcement learning), a framework that learns an adaptive video-token compression policy for multimodal LLM video understanding rather than using fixed heuristic compression.
- SCORE uses a lightweight policy network with a surprise-augmented state representation that incorporates inter-frame residuals to better capture temporal dynamics and motion saliency.
- Training is done with group-wise reinforcement learning using a split-advantage estimator, plus a two-stage curriculum that transfers from static pseudo-videos to real dynamic videos for stability.
- Experiments on multiple video understanding benchmarks show SCORE outperforms existing compression baselines and can deliver a 16x prefill speedup while retaining ~99.5% performance at a 10% token retention ratio.
- The work targets two key problems in long-form video understanding—high computational cost from redundant visual tokens and performance degradation from “context rot.”
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




