PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

arXiv cs.CV / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces UAV Reasoning Segmentation, extending “reasoning segmentation” from ground scenes to remote-sensing/UAV imagery with challenges like oblique viewpoints and extreme scale variations.
  • It formalizes the task’s semantic requirements across three reasoning dimensions: Spatial, Attribute, and Scene-level reasoning, and uses these to structure the problem definition.
  • The authors create DRSeg, a large benchmark with 10k high-resolution aerial images and Chain-of-Thought QA supervision covering all three reasoning types.
  • For a benchmark baseline, they propose PixDLM, a pixel-level multimodal language model designed as a unified, easy-to-use baseline for UAV reasoning segmentation.
  • Experiments on DRSeg report strong baseline performance while emphasizing the distinct difficulties unique to UAV reasoning segmentation, aiming to support future research.

Abstract

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.