AI Navigate

Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation

arXiv cs.AI / 3/16/2026

💬 OpinionModels & Research

Key Points

  • SERA introduces a Spatio-Semantic Expert Routing Architecture for referring image segmentation, featuring SERA-Adapter and SERA-Fusion to improve spatial coherence and boundary precision.
  • It employs a lightweight, expression-aware routing mechanism and parameter-efficient tuning that updates only normalization and bias terms (less than 1% of backbone parameters) to stay compatible with pretrained encoders.
  • SERA-Adapter inserts an expression-conditioned adapter into selected backbone blocks to enable expert-guided refinement and cross-modal attention, while SERA-Fusion reshapes token features into spatial grids with geometry-preserving expert transformations before multimodal interaction.
  • Experiments on standard benchmarks show that SERA consistently outperforms strong baselines, with notable gains on expressions requiring precise spatial localization and boundary delineation.

Abstract

Referring image segmentation aims to produce a pixel-level mask for the image region described by a natural-language expression. Although pretrained vision-language models have improved semantic grounding, many existing methods still rely on uniform refinement strategies that do not fully match the diverse reasoning requirements of referring expressions. Because of this mismatch, predictions often contain fragmented regions, inaccurate boundaries, or even the wrong object, especially when pretrained backbones are frozen for computational efficiency. To address these limitations, we propose SERA, a Spatio-Semantic Expert Routing Architecture for referring image segmentation. SERA introduces lightweight, expression-aware expert refinement at two complementary stages within a vision-language framework. First, we design SERA-Adapter, which inserts an expression-conditioned adapter into selected backbone blocks to improve spatial coherence and boundary precision through expert-guided refinement and cross-modal attention. We then introduce SERA-Fusion, which strengthens intermediate visual representations by reshaping token features into spatial grids and applying geometry-preserving expert transformations before multimodal interaction. In addition, a lightweight routing mechanism adaptively weights expert contributions while remaining compatible with pretrained representations. To make this routing stable under frozen encoders, SERA uses a parameter-efficient tuning strategy that updates only normalization and bias terms, affecting less than 1% of the backbone parameters. Experiments on standard referring image segmentation benchmarks show that SERA consistently outperforms strong baselines, with especially clear gains on expressions that require accurate spatial localization and precise boundary delineation.