Let Triggers Control: Frequency-Aware Dropout for Effective Token Control

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a controllability problem in LoRA-based text-to-image personalization where a single trigger token fails to reliably evoke the intended concept due to entangled representations.
It attributes the issue to frequent co-occurrence between the trigger token and surrounding prompt context during fine-tuning, which undermines the token’s semantic distinctiveness.
The authors propose Frequency-Aware Dropout (FAD), a parameter-free regularization method that uses co-occurrence analysis and curriculum-inspired scheduling to reduce this entanglement.
Experiments across token-based diffusion models (Stable Diffusion 1.5, SDXL) and natural-language backbones (FLUX, Qwen-Image) show improved prompt controllability, fidelity, stylistic precision, and perceived user quality.
The approach delivers consistent gains without architectural changes or additional parameters, aiming for easy adoption with low extra computational cost.

Abstract

Text-to-image models such as Stable Diffusion have achieved unprecedented levels of high-fidelity visual synthesis. As these models advance, personalization of generative models -- commonly facilitated through Low-Rank Adaptation (LoRA) with a dedicated trigger token -- has become a significant area of research. Previous works have naively assumed that fine-tuning with a single trigger token to represent new concepts. However, this often results in poor controllability, where the trigger token alone fails to reliably evoke the intended concept. We attribute this issue to the frequent co-occurrence of the trigger token with the surrounding context during fine-tuning, which entangles their representations and compromises the token's semantic distinctiveness. To disentangle this, we propose Frequency-Aware Dropout (FAD) -- a novel regularization technique that improves prompt controllability without adding new parameters. FAD consists of two key components: co-occurrence analysis and curriculum-inspired scheduling. Qualitative and quantitative analyses across token-based diffusion models (SD~1.5 and SDXL) and natural language--driven backbones (FLUX and Qwen-Image) demonstrate consistent gains in prompt fidelity, stylistic precision, and user-perceived quality. Our method provides a simple yet effective dropout strategy that enhances controllability and personalization in text-to-image generation. Notably, it achieves these improvements without introducing additional parameters or architectural modifications, making it readily applicable to existing models with minimal computational overhead.