Disentangling to Re-couple: Resolving the Similarity-Controllability Paradox in Subject-Driven Text-to-Image Generation

arXiv cs.CV / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets Subject-Driven Text-to-Image generation, focusing on the “similarity–controllability paradox” where improving textual control can harm subject identity (and vice versa).
  • It argues the paradox arises because text prompts often mix instructions for both subject identity and context edits, creating conflicting signals during generation.
  • The proposed DisCo framework disentangles visual and textual roles by extracting subject identity solely from the reference image (using the subject entity word) and simplifying the prompt to only the modification command via pronouns.
  • To prevent unnatural subject–context compositions caused by strict separation, the method introduces a dedicated reward signal and uses reinforcement learning to re-couple identity with the generated context.
  • Experiments report state-of-the-art results, achieving both high-fidelity subject preservation and precise textual control in generated images.

Abstract

Subject-Driven Text-to-Image (T2I) Generation aims to preserve a subject's identity while editing its context based on a text prompt. A core challenge in this task is the "similarity-controllability paradox", where enhancing textual control often degrades the subject's fidelity, and vice-versa. We argue this paradox stems from the ambiguous role of text prompts, which are often tasked with describing both the subject and the desired modifications, leading to conflicting signals for the model. To resolve this, we propose DisCo, a novel framework that first Disntangles and then re-Couples visual and textual information. First, our textual-visual decoupling module isolates the sources of information: subject identity is extracted exclusively from the reference image with the entity word of the subject, while the text prompt is simplified to contain only the modification command, where the subject refers to general pronouns, eliminating descriptive ambiguity. However, this strict separation can lead to unnatural compositions between the subject and its contexts. We address this by designing a dedicated reward signal and using reinforcement learning to seamlessly recouple the visually-defined subject and the textually-generated context. Our approach effectively resolves the paradox, enabling simultaneous high-fidelity subject preservation and precise textual control. Extensive experiments demonstrate that our method achieves state-of-the-art performance, producing highly realistic and coherent images.