AI Navigate

MosaicMem: Hybrid Spatial Memory for Controllable Video World Models

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

Key Points

  • MosaicMem introduces a hybrid spatial memory that lifts patches into 3D to improve localization and targeted retrieval while preserving the model's ability to follow prompts during generation.
  • It uses a patch-and-compose interface to assemble spatially aligned patches in the queried view, preserving what should persist and allowing the model to inpaint what should evolve.
  • The approach adds PRoPE camera conditioning and two memory-alignment methods, achieving better pose adherence than implicit memory and stronger dynamic modeling than explicit baselines.
  • It enables minute-level navigation, memory-based scene editing, and autoregressive rollout, supporting long-horizon, memory-consistent video world modeling.

Abstract

Video diffusion models are moving beyond short, plausible clips toward world simulators that must remain consistent under camera motion, revisits, and intervention. Yet spatial memory remains a key bottleneck: explicit 3D structures can improve reprojection-based consistency but struggle to depict moving objects, while implicit memory often produces inaccurate camera motion even with correct poses. We propose Mosaic Memory (MosaicMem), a hybrid spatial memory that lifts patches into 3D for reliable localization and targeted retrieval, while exploiting the model's native conditioning to preserve prompt-following generation. MosaicMem composes spatially aligned patches in the queried view via a patch-and-compose interface, preserving what should persist while allowing the model to inpaint what should evolve. With PRoPE camera conditioning and two new memory alignment methods, experiments show improved pose adherence compared to implicit memory and stronger dynamic modeling than explicit baselines. MosaicMem further enables minute-level navigation, memory-based scene editing, and autoregressive rollout.