AI Navigate

RSONet: Region-guided Selective Optimization Network for RGB-T Salient Object Detection

arXiv cs.CV / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes RSONet, a region-guided selective optimization network for RGB-T salient object detection to address inconsistencies between RGB and thermal regions.
  • It introduces a region guidance stage with three parallel encoder–decoder branches equipped with context interaction (CI) and spatial-aware fusion (SF) modules to generate guidance maps and similarity scores.
  • In the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on similarity to mitigate cross-modality saliency distribution differences.
  • A dense detail enhancement (DDE) module refines low-level features with dense connections and visual state space blocks, while a mutual interaction semantic (MIS) module leverages high-level features for location cues via mutual fusion.
  • Experiments on RGB-T datasets show the method achieving competitive performance against 27 state-of-the-art SOD methods.

Abstract

This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.