TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper targets Composed Image Retrieval (CIR), where a user retrieves an image using a reference image plus modification text, and identifies two practical shortcomings: insufficient entity coverage and clause–entity misalignment.
  • It introduces two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR, designed to broaden the types of salient changes expressed in queries.
  • It proposes TEMA (Text-oriented Entity Mapping Architecture), described as the first CIR framework that is built for multi-modification while still supporting simple modifications.
  • Experiments across four benchmarks show TEMA improves performance for both original and multi-modification scenarios while keeping a good balance between retrieval accuracy and computational efficiency.
  • The authors release the code and the constructed multi-modification datasets via the provided GitHub repository.

Abstract

Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.