XSeg: A Large-scale X-ray Contraband Segmentation Benchmark For Real-World Security Screening

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces XSeg, a new large-scale X-ray contraband segmentation benchmark with 98,644 images and 295,932 instance masks across 30 contraband categories, addressing the lack of real-world pixel-level supervision in prior work.
  • XSeg is built from public and synthesized X-ray sources, with a custom data cleaning pipeline to filter out low-quality samples and improve dataset reliability.
  • To reduce annotation cost and improve segmentation quality, the authors propose Adaptive Point SAM (APSAM), a SAM-based mask annotation model using adaptive point prompting rather than expensive pixel-level labeling.
  • APSAM targets known SAM limitations—cross-domain generalization and difficulty with stacked/overlapping objects—by adding an Energy-Aware Encoder and an Adaptive Point Generator for more sensitive initialization and accurate mask labels from minimal prompts.
  • Experimental results reported on XSeg indicate APSAM achieves superior performance, positioning the dataset and method as practical resources for improving real-world security screening models.

Abstract

X-ray contraband detection is critical for public safety. However, current methods primarily rely on bounding box annotations, which limit model generalization and performance due to the lack of pixel-level supervision and real-world data. To address these limitations, we introduce XSeg. To the best of our knowledge, XSeg is the largest X-ray contraband segmentation dataset to date, including 98,644 images and 295,932 instance masks, and contains the latest 30 common contraband categories. The images are sourced from public datasets and our synthesized data, filtered through a custom data cleaning pipeline to remove low-quality samples. To enable accurate and efficient annotation and reduce manual labeling effort, we propose Adaptive Point SAM (APSAM), a specialized mask annotation model built upon the Segment Anything Model (SAM). We address SAM's poor cross-domain generalization and limited capability in detecting stacked objects by introducing an Energy-Aware Encoder that enhances the initialization of the mask decoder, significantly improving sensitivity to overlapping items. Additionally, we design an Adaptive Point Generator that allows users to obtain precise mask labels with only a single coarse point prompt. Extensive experiments on XSeg demonstrate the superior performance of APSAM.