Moondream Segmentation: From Words to Masks

arXiv cs.AI / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Moondream Segmentation is introduced as an extension of the Moondream 3 vision-language model that performs referring image segmentation from an input image and a textual expression.
  • The method autoregressively decodes a vector path and iteratively raster-refines it into a detailed final mask, combining vector-to-raster refinement for higher-quality outputs.
  • A reinforcement learning stage is used to resolve ambiguities in supervised training by directly optimizing mask quality, generating coarse-to-ground-truth targets for the refinement module.
  • To improve evaluation reliability, the paper releases RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks to reduce noise from polygon annotations.
  • Reported results include 80.2% cIoU on RefCOCO (val) and 62.6% mIoU on LVIS (val), indicating strong segmentation performance across benchmarks.

Abstract

We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).