PanoSAM2: Lightweight Distortion- and Memory-aware Adaptions of SAM2 for 360 Video Object Segmentation

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PanoSAM2, a lightweight framework that adapts SAM2 to the 360 video object segmentation (360VOS) setting while keeping SAM2’s promptable VOS usability.
  • It addresses 360-specific challenges—projection distortion and left-right semantic inconsistency—using a Pano-Aware Decoder with seam-consistent receptive fields plus iterative distortion refinement across the 0/360 boundary.
  • It incorporates a Distortion-Guided Mask Loss that upweights regions and boundaries with larger distortion magnitudes to improve mask reliability under stretching artifacts.
  • To mitigate sparse object information in SAM2’s memory for 360 videos, it adds a Long-Short Memory Module that maintains a compact long-term object pointer to better re-instantiate and align short-term memories, improving temporal coherence.
  • Experiments report substantial performance gains over SAM2, including +5.6 on 360VOTS and +6.7 on PanoVOS, indicating the proposed distortion- and memory-aware adaptations are effective.

Abstract

360 video object segmentation (360VOS) aims to predict temporally-consistent masks in 360 videos, offering full-scene coverage, benefiting applications, such as VR/AR and embodied AI. Learning 360VOS model is nontrivial due to the lack of high-quality labeled dataset. Recently, Segment Anything Models (SAMs), especially SAM2 -- with its design of memory module -- shows strong, promptable VOS capability. However, directly using SAM2 for 360VOS yields implausible results as 360 videos suffer from the projection distortion, semantic inconsistency of left-right sides, and sparse object mask information in SAM2's memory. To this end, we propose PanoSAM2, a novel 360VOS framework based on our lightweight distortion- and memory-aware adaptation strategies of SAM2 to achieve reliable 360VOS while retaining SAM2's user-friendly prompting design. Concretely, to tackle the projection distortion and semantic inconsistency issues, we propose a Pano-Aware Decoder with seam-consistent receptive fields and iterative distortion refinement to maintain continuity across the 0/360 degree boundary. Meanwhile, a Distortion-Guided Mask Loss is introduced to weight pixels by distortion magnitude, stressing stretched regions and boundaries. To address the object sparsity issue, we propose a Long-Short Memory Module to maintain a compact long-term object pointer to re-instantiate and align short-term memories, thereby enhancing temporal coherence. Extensive experiments show that PanoSAM2 yields substantial gains over SAM2: +5.6 on 360VOTS and +6.7 on PanoVOS, showing the effectiveness of our method.