When Identities Collapse: A Stress-Test Benchmark for Multi-Subject Personalization

arXiv cs.AI / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing evaluation for subject-driven text-to-image diffusion models overestimates performance because global CLIP metrics miss local “identity collapse” and multi-subject entanglement failure modes.
It identifies an “Illusion of Scalability,” where models work for 2–4 subjects but catastrophically degrade when scaled to 6–10 subjects or when asked to model complex physical interactions (e.g., occlusion and interaction).
To stress-test this issue, the authors build a benchmark of 75 prompts spanning different subject counts and interaction difficulty levels: Neutral, Occlusion, and Interaction.
They introduce a new metric, Subject Collapse Rate (SCR), using DINOv2 structural priors to better detect and penalize identity homogenization via local attention leakage.
Results across several state-of-the-art models show identity fidelity sharply declines with increasing scene complexity, with SCR approaching 100% at 10 subjects, and the authors attribute this to semantic shortcuts from global attention routing.

Abstract

Subject-driven text-to-image diffusion models have achieved remarkable success in preserving single identities, yet their ability to compose multiple interacting subjects remains largely unexplored and highly challenging. Existing evaluation protocols typically rely on global CLIP metrics, which are insensitive to local identity collapse and fail to capture the severity of multi-subject entanglement. In this paper, we identify a pervasive "Illusion of Scalability" in current models: while they excel at synthesizing 2-4 subjects in simple layouts, they suffer from catastrophic identity collapse when scaled to 6-10 subjects or tasked with complex physical interactions. To systematically expose this failure mode, we construct a rigorous stress-test benchmark comprising 75 prompts distributed across varying subject counts and interaction difficulties (Neutral, Occlusion, Interaction). Furthermore, we demonstrate that standard CLIP-based metrics are fundamentally flawed for this task, as they often assign high scores to semantically correct but identity-collapsed images (e.g., generating generic clones). To address this, we introduce the Subject Collapse Rate (SCR), a novel evaluation metric grounded in DINOv2's structural priors, which strictly penalizes local attention leakage and homogenization. Our extensive evaluation of state-of-the-art models (MOSAIC, XVerse, PSR) reveals a precipitous drop in identity fidelity as scene complexity grows, with SCR approaching 100% at 10 subjects. We trace this collapse to the semantic shortcuts inherent in global attention routing, underscoring the urgent need for explicit physical disentanglement in future generative architectures.