Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

arXiv cs.CV / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces STFER (Semantic-driven Token Filtering and Expert Routing) for any-time person re-identification under large modality shifts (RGB/IR) and major clothing changes.
STFER uses Large Vision-Language Models (LVLMs) to generate identity-consistency semantic text that encodes identity-discriminative, biometric-constant information.
It applies this semantic text in two mechanisms: Semantic-driven Visual Token Filtering (SVTF) to emphasize informative visual regions while suppressing background noise, and Semantic-driven Expert Routing (SER) to improve multi-scenario gating.
Experiments on the AT-USTC dataset show state-of-the-art performance, and a model trained on AT-USTC generalizes strongly to five widely used ReID benchmarks.
The authors state that the code will be released soon, enabling further research and replication.

Abstract

Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.