Evolutionary Token-Level Prompt Optimization for Diffusion Models

arXiv cs.AI / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how text-to-image diffusion models are highly sensitive to prompt wording and often require manual trial-and-error, motivating automated prompt optimization beyond simple text rewriting.
  • It proposes an evolution-based, model-agnostic method using a Genetic Algorithm to directly evolve token vectors in CLIP-based diffusion models, treating prompt conditioning as an optimization search space.
  • The GA’s fitness function combines aesthetic scoring via LAION Aesthetic Predictor V2 with semantic alignment using CLIPScore between the generated image and the prompt.
  • Experiments on 36 prompts from the Parti Prompts (P2) dataset show the approach outperforms baselines such as Promptist and random search, reaching up to a 23.93% improvement in fitness.
  • The authors claim the framework is modular and extensible for other image generation models that use tokenized text encoders.

Abstract

Text-to-image diffusion models exhibit strong generative performance but remain highly sensitive to prompt formulation, often requiring extensive manual trial and error to obtain satisfactory results. This motivates the development of automated, model-agnostic prompt optimization methods that can systematically explore the conditioning space beyond conventional text rewriting. This work investigates the use of a Genetic Algorithm (GA) for prompt optimization by directly evolving the token vectors employed by CLIP-based diffusion models. The GA optimizes a fitness function that combines aesthetic quality, measured by the LAION Aesthetic Predictor V2, with prompt-image alignment, assessed via CLIPScore. Experiments on 36 prompts from the Parti Prompts (P2) dataset show that the proposed approach outperforms the baseline methods, including Promptist and random search, achieving up to a 23.93% improvement in fitness. Overall, the method is adaptable to image generation models with tokenized text encoders and provides a modular framework for future extensions, the limitations and prospects of which are discussed.