PersonalQ: Select, Quantize, and Serve Personalized Diffusion Models for Efficient Inference

arXiv cs.AI / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PersonalQ, a unified framework for efficiently serving personalized diffusion-model checkpoint repositories by addressing both request ambiguity and fidelity loss from naive quantization.
  • PersonalQ’s checkpoint “check-in” stage uses intent-aware hybrid retrieval plus LLM-based reranking, and it asks a short clarification question only when multiple checkpoint intents remain plausible.
  • It rewrites user prompts by inserting a selected checkpoint’s canonical “trigger token,” creating a shared signal that links selection to downstream processing.
  • The accompanying Trigger-Aware Quantization (TAQ) uses trigger-aware mixed precision in cross-attention to preserve trigger-conditioned key/value rows (and attention weights) while aggressively quantizing other paths for better memory efficiency.
  • Experiments indicate improved intent alignment versus retrieval/reranking baselines and a better compression–quality trade-off than prior diffusion post-training quantization methods, supporting scalable deployment of personalized checkpoints.

Abstract

Personalized text-to-image generation lets users fine-tune diffusion models into repositories of concept-specific checkpoints, but serving these repositories efficiently is difficult for two reasons: natural-language requests are often ambiguous and can be misrouted to visually similar checkpoints, and standard post-training quantization can distort the fragile representations that encode personalized concepts. We present PersonalQ, a unified framework that connects checkpoint selection and quantization through a shared signal -- the checkpoint's trigger token. Check-in performs intent-aligned selection by combining intent-aware hybrid retrieval with LLM-based reranking over checkpoint context and asks a brief clarification question only when multiple intents remain plausible; it then rewrites the prompt by inserting the selected checkpoint's canonical trigger. Complementing this, Trigger-Aware Quantization (TAQ) applies trigger-aware mixed precision in cross-attention, preserving trigger-conditioned key/value rows (and their attention weights) while aggressively quantizing the remaining pathways for memory-efficient inference. Experiments show that PersonalQ improves intent alignment over retrieval and reranking baselines, while TAQ consistently offers a stronger compression-quality trade-off than prior diffusion PTQ methods, enabling scalable serving of personalized checkpoints without sacrificing fidelity.