HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

arXiv cs.CV / 4/22/2026

📰 NewsModels & Research

共有:

Key Points

The paper highlights that aerial object detection models often struggle to generalize across datasets due to differences in spatial resolution, scene composition, sensor characteristics, and label/category coverage.
It proposes HMR-Net, a hierarchical modular learning framework that routes data to global experts via latent geographic embeddings and further decomposes scenes to route subregions to local, region-specific modules.
The framework includes a conditional expert mechanism that leverages external semantic inputs (such as category names or textual descriptions) to detect novel object categories at inference time without retraining or fine-tuning.
Experiments on four aerial-image datasets show gains in multi-dataset generalization, regional specialization, and open-category (novel category) detection.

Abstract

Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.