Static Scene Reconstruction from Dynamic Egocentric Videos

arXiv cs.CV / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses 3D static scene reconstruction from long-form egocentric (first-person) videos, where fast camera motion and moving hands cause failures in existing static reconstruction methods like MapAnything.
  • It proposes a mask-aware reconstruction pipeline that suppresses dynamic foreground in attention layers to prevent hand motion from contaminating the learned static map.
  • The method uses chunked reconstruction combined with pose-graph stitching to maintain global consistency and reduce long-term trajectory drift.
  • Experiments on HD-EPIC and indoor drone datasets show improved absolute trajectory error and cleaner static geometry versus naive baselines, suggesting a practical extension of foundation-model-style approaches to dynamic first-person scenes.

Abstract

Egocentric videos present unique challenges for 3D reconstruction due to rapid camera motion and frequent dynamic interactions. State-of-the-art static reconstruction systems, such as MapAnything, often degrade in these settings, suffering from catastrophic trajectory drift and "ghost" geometry caused by moving hands. We bridge this gap by proposing a robust pipeline that adapts static reconstruction backbones to long-form egocentric video. Our approach introduces a mask-aware reconstruction mechanism that explicitly suppresses dynamic foreground in the attention layers, preventing hand artifacts from contaminating the static map. Furthermore, we employ a chunked reconstruction strategy with pose-graph stitching to ensure global consistency and eliminate long-term drift. Experiments on HD-EPIC and indoor drone datasets demonstrate that our pipeline significantly improves absolute trajectory error and yields visually clean static geometry compared to naive baselines, effectively extending the capability of foundation models to dynamic first-person scenes.