PrefPO: Pairwise Preference Prompt Optimization

arXiv cs.CL / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • PrefPO proposes a minimal, RLHF-inspired prompt optimization method that reduces the need for labeled data and hyperparameter tuning, requiring only a starting prompt and natural language criteria.
  • It employs an LLM discriminator to express pairwise preferences over model outputs and feeds feedback to a separate LLM optimizer to iteratively refine prompts.
  • In evaluations on 9 BIG-Bench Hard tasks and IFEval-Hard, PrefPO matches or exceeds SOTA methods on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard, in both labeled and unlabeled settings.
  • It also improves prompt hygiene by reducing length and repetitiveness, lowers susceptibility to prompt hacking compared with TextGrad, and receives higher ratings from both LLM judges and human evaluators.

Abstract

Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.