PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

arXiv cs.CL / 3/12/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

PEEM proposes a unified, interpretable evaluation framework for both prompts and responses in LLMs, using a 9-axis rubric that covers three prompt criteria and six response criteria.
It employs an LLM-based evaluator to produce 1-5 Likert scores and criterion-specific natural-language rationales grounded in the rubric, enabling actionable diagnostics.
On seven benchmarks and five task models, PEEM's accuracy scores align closely with conventional accuracy while preserving model rankings (Spearman ~0.97, Pearson ~0.94, p<0.001).
A multi-evaluator study shows evaluator-agnostic judgments (pairwise rho ~0.68-0.85); the framework detects linguistic failure modes under perturbations, demonstrates robustness to paraphrases (76.7-80.6%), and enables a zero-shot prompting loop that can improve downstream accuracy by up to 11.7 points.

Abstract

Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.

The Complete Guide to AI Prompts for Content Creators

Dev.to

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

From Piles to Protocol: AI for Vendor Compliance at Scale

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

Still paying 4 years for a tech career

Dev.to

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Key Points

Abstract

Related Articles

The Complete Guide to AI Prompts for Content Creators

Automating the Chase: AI for Festival Vendor Compliance

From Piles to Protocol: AI for Vendor Compliance at Scale

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Still paying 4 years for a tech career

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer