Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses crowd instance segmentation where datasets commonly provide point labels, but high-quality region/mask labels are scarce and inaccurate, limiting downstream accuracy for counting and localization.
  • It introduces Dense Point-to-Mask Optimization (DPMO), combining SAM with a Nearest Neighbor Exclusive Circle (NNEC) constraint to convert dense crowd point annotations into improved mask annotations (with optional manual correction).
  • For prediction in dense scenes, it proposes Reinforced Point Selection (RPS), which uses Group Relative Policy Optimization (GRPO) to select the best point from sampled candidates before generating instance outputs.
  • Experiments report state-of-the-art performance on multiple crowd datasets (ShanghaiTech, UCF-QNRF, JHU-CROWD++, NWPU-Crowd), and the authors show that mask-supervised losses can significantly improve counting accuracy across models.
  • Overall, the work highlights that dense crowd segmentation can be improved by better point-to-mask pseudo-label generation and by reinforcement-style point selection rather than directly applying standard foundation-model prompting.

Abstract

Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.