RS-SSM: Refining Forgotten Specifics in State Space Model for Video Semantic Segmentation

arXiv cs.CV / 3/26/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses limitations of state space models in video semantic segmentation, noting that fixed-size state representations can “forget” specific spatiotemporal details needed for pixel-level accuracy and temporal consistency.
  • It introduces RS-SSM (Refining Specifics State Space Model), which adds targeted mechanisms to recover and refine the forgotten specific information during state space compression.
  • RS-SSM uses a Channel-wise Amplitude Perceptron (CwAP) to extract and align distribution characteristics of specific information in the state space.
  • It also proposes a Forgetting Gate Information Refiner (FGIR) that adaptively inverts and refines the forgetting gate matrix based on the learned specific-information distribution.
  • Experiments on four video semantic segmentation benchmarks show state-of-the-art results while retaining computational efficiency, and the authors provide public code on GitHub.

Abstract

Recently, state space models have demonstrated efficient video segmentation through linear-complexity state space compression. However, Video Semantic Segmentation (VSS) requires pixel-level spatiotemporal modeling capabilities to maintain temporal consistency in segmentation of semantic objects. While state space models can preserve common semantic information during state space compression, the fixed-size state space inevitably forgets specific information, which limits the models' capability for pixel-level segmentation. To tackle the above issue, we proposed a Refining Specifics State Space Model approach (RS-SSM) for video semantic segmentation, which performs complementary refining of forgotten spatiotemporal specifics. Specifically, a Channel-wise Amplitude Perceptron (CwAP) is designed to extract and align the distribution characteristics of specific information in the state space. Besides, a Forgetting Gate Information Refiner (FGIR) is proposed to adaptively invert and refine the forgetting gate matrix in the state space model based on the specific information distribution. Consequently, our RS-SSM leverages the inverted forgetting gate to complementarily refine the specific information forgotten during state space compression, thereby enhancing the model's capability for spatiotemporal pixel-level segmentation. Extensive experiments on four VSS benchmarks demonstrate that our RS-SSM achieves state-of-the-art performance while maintaining high computational efficiency. The code is available at https://github.com/zhoujiahuan1991/CVPR2026-RS-SSM.