Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search

arXiv cs.CV / 4/28/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets text-based person anomaly search in surveillance archives, noting that pose-aware methods still suffer from a fundamental Pose-Semantic Gap where different actions can look geometrically similar.
  • It argues that although Multimodal LLMs could resolve part of this ambiguity, they are too computationally expensive for large-scale retrieval.
  • The proposed Structure-Semantic Decoupled Cascade (SSDC) framework splits retrieval into two stages: structure-aware coarse filtering using skeletal similarity, followed by multi-agent semantic verification.
  • The “Detective Squad” multi-agent system includes a Detective for binary candidate filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis, after which candidates are re-ranked by combining synthesized captions with structural priors.
  • Experiments on the PAB benchmark report state-of-the-art performance, balancing retrieval efficiency with stronger semantic reasoning.

Abstract

Text-based person anomaly search retrieves specific behavioral events from surveillance archives using natural-language queries. Although recent pose-aware methods align geometric structures well, they face a fundamental Pose-Semantic Gap: semantically different actions can share similar skeletal geometries. While Multimodal Large Language Models (MLLMs) can reduce this ambiguity, using them for large-scale retrieval is computationally prohibitive. We propose the Structure-Semantic Decoupled Cascade (SSDC) framework, which decouples retrieval into two stages: (1) Structure-Aware Coarse Retrieval, where a lightweight model quickly filters candidates by skeletal similarity ; and (2) Detective Squad Interaction, a multi-agent semantic verification module. The squad consists of a Detective for fast binary filtering, an Analyst for evidence extraction, and a Writer for semantic synthesis. Finally, we re-rank candidates by fusing the synthesized captions with structural priors. Experiments on the PAB benchmark show that SSDC achieves state-of-the-art performance by balancing efficiency and semantic reasoning.

Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search | AI Navigate