CLPIPS: A Personalized Metric for AI-Generated Image Similarity

arXiv cs.CV / 4/3/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CLPIPS, a personalized extension of the LPIPS image similarity metric designed to better match human judgments of similarity in text-to-image workflows.
  • It argues that existing similarity metrics (e.g., LPIPS, CLIP) can misalign with human perceptions in context-specific, user-driven tasks, motivating metric adaptation.
  • CLPIPS is fine-tuned using human-ranked image pairs with margin ranking loss, updating only the LPIPS layer-combination weights rather than the full model.
  • Experiments on an iterative human study show improved alignment between metric outputs and human rankings, measured via Spearman rank correlation and intraclass correlation.
  • The authors position similarity metrics as adaptive, human-in-the-loop components that can be improved with lightweight, human-augmented tuning rather than solely chasing better absolute metric scores.

Abstract

Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric's notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.