InterPartAbility: Text-Guided Part Matching for Interpretable Person Re-Identification

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the interpretability gap in text-to-image person re-identification (TI-ReID), where vision-language models can match images but provide explanations that are not reliably tied to semantic concepts.
  • It introduces InterPartAbility, which performs explicit part-wise matching and phrase-region grounding to better connect visual evidence to meaningful textual parts.
  • The proposed patch-phrase interaction module (PPIM) uses open-vocabulary, lightweight concept-level supervision to guide a standard TI-ReID model toward attending to the corresponding image regions for each part phrase.
  • InterPartAbility also constrains CLIP ViT self-attention to produce spatially concentrated patch activations that align with part-level phrases, enabling more grounded explanation maps.
  • The work adds a perturbation-based quantitative interpretability protocol, including counterfactual region masking that tests how retrieval quality degrades when top explanatory regions are removed, and reports SOTA interpretability on CUHK-PEDES and ICFG-PEDES without sacrificing retrieval accuracy.

Abstract

Text-to-image person re-identification (TI-ReID) relies on natural-language text description to retrieve top matching individuals from a large gallery of images. While recent large vision-language models (VLMs) achieve strong retrieval performance, their decisions remain largely uninterpretable. Existing interpretability approaches in TI-ReID rely solely on slot-attention to highlight attended regions, but fail to reliably bind visual regions to semantically meaningful concepts, limiting explanations to qualitative visualizations over a restricted vocabulary. This paper introduces InterPartAbility, an interpretable TI-ReID method that performs explicit part-wise matching and enables phrase-region grounding. A new open-vocabulary, lightweight supervision, patch-phrase interaction module (PPIM) is proposed to train a standard TI-ReID model with concept-level guidance. Concept-based part phrases provide evidence that encourages the model to attend to corresponding image regions. InterPartAbility further constrains CLIP ViT self-attention to produce spatially concentrated patch activations aligned with each part-level phrase, yielding grounded explanation maps. A quantitative interpretability protocol for TI-ReID is introduced by adapting perturbation-based evaluation metrics, including counterfactual region masking that measures retrieval degradation when top-ranked explanatory regions are removed. Empirical results\footnote{Our code is included in the supplementary materials and will be made public.} on challenging benchmarks like CUHK-PEDES and ICFG-PEDES show that InterPartAbility achieves state-of-the-art (SOTA) interpretability performance under these metrics, while sustaining competitive retrieval accuracy.