WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article introduces WorldMM, a multimodal memory agent designed to improve long video reasoning where existing video LLMs struggle with limited context and loss of visual details over long durations.
WorldMM uses three complementary memory types—episodic (multi-scale factual events), semantic (continuously updated concepts), and visual (preserves detailed scene information)—to overcome the text-only reliance of many memory-augmented approaches.
During inference, an adaptive retrieval agent iteratively chooses the most relevant memory source and dynamically varies temporal granularity based on the query until it decides enough information has been collected.
The method is reported to outperform prior state-of-the-art baselines on five long video question-answering benchmarks, achieving an average 8.4% performance gain.
The work specifically targets flexibility for events with variable durations by avoiding fixed temporal-scale retrieval strategies used in earlier memory methods.

Abstract

Recent advances in video large language models have demonstrated strong capabilities in understanding short clips. However, scaling them to hours- or days-long videos remains highly challenging due to limited context capacity and the loss of critical visual details during abstraction. Existing memory-augmented methods mitigate this by leveraging textual summaries of video segments, yet they heavily rely on text and fail to utilize visual evidence when reasoning over complex scenes. Moreover, retrieving from fixed temporal scales further limits their flexibility in capturing events that span variable durations. To address this, we introduce WorldMM, a novel multimodal memory agent that constructs and retrieves from multiple complementary memories, encompassing both textual and visual representations. WorldMM comprises three types of memory: episodic memory indexes factual events across multiple temporal scales, semantic memory continuously updates high-level conceptual knowledge, and visual memory preserves detailed information about scenes. During inference, an adaptive retrieval agent iteratively selects the most relevant memory source and leverages multiple temporal granularities based on the query, continuing until it determines that sufficient information has been gathered. WorldMM significantly outperforms existing baselines across five long video question-answering benchmarks, achieving an average 8.4% performance gain over previous state-of-the-art methods, showing its effectiveness on long video reasoning.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

Key Points

Abstract

💡 Insights using this article

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer