FSDETR: Frequency-Spatial Feature Enhancement for Small Object Detection

arXiv cs.CV / 4/17/2026

📰 NewsModels & Research

Key Points

  • Small object detection is difficult because downsampling degrades features, dense scenes cause mutual occlusion, and complex backgrounds interfere with recognition.
  • The paper introduces FSDETR, a frequency–spatial feature enhancement framework built on the RT-DETR baseline, aiming to better preserve complementary structural information.
  • FSDETR uses a Spatial Hierarchical Attention Block (SHAB) to capture both local details and global dependencies for stronger semantic representation.
  • To address occlusion and dense-scene challenges, it adds a Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) that performs dynamic sampling of informative regions.
  • It also proposes a Frequency-Spatial Feature Pyramid Network (FSFPN) with a Cross-domain Frequency-Spatial Block (CFSB) that combines frequency filtering with spatial edge extraction, achieving strong small-object results with only 14.7M parameters.

Abstract

Small object detection remains a significant challenge due to feature degradation from downsampling, mutual occlusion in dense clusters, and complex background interference. To address these issues, this paper proposes FSDETR, a frequency-spatial feature enhancement framework built upon the RT-DETR baseline. By establishing a collaborative modeling mechanism, the method effectively leverages complementary structural information. Specifically, a Spatial Hierarchical Attention Block (SHAB) captures both local details and global dependencies to strengthen semantic representation. Furthermore, to mitigate occlusion in dense scenes, the Deformable Attention-based Intra-scale Feature Interaction (DA-AIFI) focuses on informative regions via dynamic sampling. Finally, the Frequency-Spatial Feature Pyramid Network (FSFPN) integrates frequency filtering with spatial edge extraction via the Cross-domain Frequency-Spatial Block (CFSB) to preserve fine-grained details. Experimental results show that with only 14.7M parameters, FSDETR achieves 13.9% APS on VisDrone 2019 and 48.95% AP50 tiny on TinyPerson, showing strong performance on small-object benchmarks. The code and models are available at https://github.com/YT3DVision/FSDETR.