Camouflage-aware Image-Text Retrieval via Expert Collaboration

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a new benchmark task, camouflage-aware image-text retrieval (CA-ITR), aimed at improving cross-modal alignment for camouflaged scenes where existing methods struggle.
  • It introduces a dedicated dataset, CamoIT, with about 10.5K samples and multi-granularity text annotations to evaluate retrieval under camouflage and complex-image conditions.
  • The authors present CECNet, a camouflage-expert collaborative network with dual visual-encoder branches (holistic features plus a branch specialized for camouflaged objects).
  • A confidence-conditioned graph attention mechanism (C2GA) is used to fuse complementary information across branches, improving robustness.
  • Experiments on CamoIT show CECNet delivers roughly a 29% overall accuracy boost over seven representative retrieval baselines, and the dataset/code are shared via GitHub.

Abstract

Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval'' (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising \sim10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects' camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves \sim29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.