Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism
arXiv cs.CV / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes FlexMem, a training-free visual memory mechanism to improve long-form video understanding for multimodal large language models (MLLMs) beyond typical input length limits.
- FlexMem treats visual KV caches as memory sources and uses a dual-pathway compression design to enable efficient memory transfer and writing as video context grows.
- It investigates multiple memory reading strategies tailored to different video understanding tasks, including streaming-style scenarios.
- Experiments on two video-MLLMs across five long-video and one streaming dataset show substantial gains versus existing efficient methods, including processing more than 1,000 frames on a single RTX 3090 GPU.
- The approach can also strengthen base MLLMs, producing comparable or better benchmark performance than state-of-the-art proprietary models on some tasks (e.g., GPT-4o and Gemini-1.5 Pro).
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Knowledge Governance For The Agentic Economy.
Dev.to

AI server farms heat up the neighborhood for miles around, paper finds
The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm
Dev.to
Does the Claude “leak” actually change anything in practice?
Reddit r/LocalLLaMA