MemCam: Memory-Augmented Camera Control for Consistent Video Generation

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • MemCam is a memory-augmented framework for interactive video generation that uses previously generated frames as external memory to maintain scene consistency while the camera changes dynamically.
  • It conditions camera viewpoint control on retrieved historical frames to keep the generated scenes coherent over longer sequences, especially under large camera rotations.
  • To scale to longer context without excessive compute, MemCam introduces a context compression module that encodes memory frames into compact representations.
  • It further employs a co-visibility-based retrieval strategy to select the most relevant past frames, improving contextual usefulness while reducing computational overhead.
  • Experiments on interactive video generation tasks indicate MemCam substantially outperforms baseline methods and open-source state-of-the-art approaches on scene consistency in long-video scenarios.

Abstract

Interactive video generation has significant potential for scene simulation and video creation. However, existing methods often struggle with maintaining scene consistency during long video generation under dynamic camera control due to limited contextual information. To address this challenge, we propose MemCam, a memory-augmented interactive video generation approach that treats previously generated frames as external memory and leverages them as contextual conditioning to achieve controllable camera viewpoints with high scene consistency. To enable longer and more relevant context, we design a context compression module that encodes memory frames into compact representations and employs co-visibility-based selection to dynamically retrieve the most relevant historical frames, thereby reducing computational overhead while enriching contextual information. Experiments on interactive video generation tasks show that MemCam significantly outperforms existing baseline methods as well as open-source state-of-the-art approaches in terms of scene consistency, particularly in long video scenarios with large camera rotations.