SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SLARM, a feed-forward model designed to unify dynamic scene reconstruction, semantic understanding, and real-time streaming inference into a single framework.
  • SLARM models complex, non-uniform motion using higher-order motion modeling and trains using only differentiable renderings, avoiding explicit flow supervision.
  • It distills language-aligned semantic representations from LSeg to enable semantic querying through natural language while tightly coupling semantics with geometry for improved accuracy and robustness.
  • For low-latency streaming, SLARM processes image sequences with window-based causal attention to maintain stability without accumulating memory costs.
  • Reported results show SLARM achieves state-of-the-art performance, including a 21% improvement in motion accuracy, +1.6 dB reconstruction PSNR, and +20% segmentation mIoU versus existing methods.

Abstract

We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy by 21%, reconstruction PSNR by 1.6 dB, and segmentation mIoU by 20% over existing methods.