AttriBE: Quantifying Attribute Expressivity in Body Embeddings for Recognition and Identification

arXiv cs.CV / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper analyzes person re-identification embeddings by introducing “expressivity,” measured as mutual information between learned features and target attributes using a secondary neural network.
Experiments on three transformer-based ReID models show BMI is the most expressively encoded attribute, especially in deeper layers, while pose peaks in intermediate layers and changes across training epochs.
In cross-spectral person identification across infrared bands (short-, medium-, and long-wave), pitch becomes as expressive as BMI and attribute expressivity increases monotonically with depth, indicating greater reliance on structural cues when bridging modalities.
The authors conclude that transformer-based ReID embeddings contain an attribute hierarchy, with morphometric information persistently represented and pose contributing more strongly under cross-spectral conditions.
The findings provide a quantitative way to study and potentially mitigate fairness/generalization risks caused by attribute leakage (e.g., gender, pose, BMI).

Abstract

Person re-identification (ReID) systems that match individuals across images or video frames are essential in many real-world applications. However, existing methods are often influenced by attributes such as gender, pose, and body mass index (BMI), which vary in unconstrained settings and raise concerns related to fairness and generalization. To address this, we extend the notion of expressivity, defined as the mutual information between learned features and specific attributes, using a secondary neural network to quantify how strongly attributes are encoded. Applying this framework to three transformer-based ReID models on a large-scale visible-spectrum dataset, we find that BMI consistently shows the highest expressivity in deeper layers. Attributes in the final representation are ranked as BMI > Pitch > Gender > Yaw, and expressivity evolves across layers and training epochs, with pose peaking in intermediate layers and BMI strengthening with depth. We further extend the analysis to cross-spectral person identification across infrared modalities including short-wave, medium-wave, and long-wave infrared. In this setting, pitch becomes comparable to BMI and attribute trends increase monotonically across depth, suggesting increased reliance on structural cues when bridging modality gaps. Overall, the results show that transformer-based ReID embeddings encode a hierarchy of implicit attributes, with morphometric information persistently embedded and pose contributing more strongly under cross-spectral conditions.