From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Existing CLIP-based person re-identification pipelines often rely on global [CLS] feature aggregation, which is not well-suited for spatial selectivity and becomes fragile under occlusion and cross-camera changes.
The paper introduces SAGA-ReID, a method that reconstructs identity representations by aligning intermediate patch tokens with anchor vectors defined in CLIP’s text-embedding space, boosting robustness without needing per-image textual descriptions.
Experiments explicitly isolate the aggregation mechanism under synthetic masking (missing identity signal) and realistic distractor overlap (semantically confusing signals), showing SAGA’s gains grow as occlusion increases in both settings.
Across multiple benchmarks, SAGA-ReID delivers consistent improvements over CLIP-ReID, with the biggest benefit on occluded data where global pooling fails—up to +10.6 Rank-1.
The authors also show SAGA’s structured reconstruction outperforms sequential patch aggregation even with a stronger backbone, suggesting the limitation is specific to aggregation rather than just backbone strength or architectural complexity.

Abstract

CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP's text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA's advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA's aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID.

Legal Insight Transformation: 7 Mistakes to Avoid When Adopting AI Tools

Dev.to

Legal Insight Transformation: Traditional vs. AI-Driven Research Compared

Dev.to

Legal Insight Transformation: A Beginner's Guide to Modern Research

Dev.to

I tested the same prompt across multiple AI models… the differences surprised me

Reddit r/artificial

The five loops between AI coding and AI engineering

Dev.to

From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification

Key Points

Abstract

Related Articles

Legal Insight Transformation: 7 Mistakes to Avoid When Adopting AI Tools

Legal Insight Transformation: Traditional vs. AI-Driven Research Compared

Legal Insight Transformation: A Beginner's Guide to Modern Research

I tested the same prompt across multiple AI models… the differences surprised me

The five loops between AI coding and AI engineering

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer