PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
arXiv cs.CL / 3/12/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- PEEM proposes a unified, interpretable evaluation framework for both prompts and responses in LLMs, using a 9-axis rubric that covers three prompt criteria and six response criteria.
- It employs an LLM-based evaluator to produce 1-5 Likert scores and criterion-specific natural-language rationales grounded in the rubric, enabling actionable diagnostics.
- On seven benchmarks and five task models, PEEM's accuracy scores align closely with conventional accuracy while preserving model rankings (Spearman ~0.97, Pearson ~0.94, p<0.001).
- A multi-evaluator study shows evaluator-agnostic judgments (pairwise rho ~0.68-0.85); the framework detects linguistic failure modes under perturbations, demonstrates robustness to paraphrases (76.7-80.6%), and enables a zero-shot prompting loop that can improve downstream accuracy by up to 11.7 points.
Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像
Ledge.ai

The programming passion is melting
Dev.to

Best AI Tools for Property Managers in 2026
Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails
Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations
Dev.to