PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
arXiv cs.CL / 3/12/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- PEEM proposes a unified, interpretable evaluation framework for both prompts and responses in LLMs, using a 9-axis rubric that covers three prompt criteria and six response criteria.
- It employs an LLM-based evaluator to produce 1-5 Likert scores and criterion-specific natural-language rationales grounded in the rubric, enabling actionable diagnostics.
- On seven benchmarks and five task models, PEEM's accuracy scores align closely with conventional accuracy while preserving model rankings (Spearman ~0.97, Pearson ~0.94, p<0.001).
- A multi-evaluator study shows evaluator-agnostic judgments (pairwise rho ~0.68-0.85); the framework detects linguistic failure modes under perturbations, demonstrates robustness to paraphrases (76.7-80.6%), and enables a zero-shot prompting loop that can improve downstream accuracy by up to 11.7 points.