Calibrating Model-Based Evaluation Metrics for Summarization

arXiv cs.CL / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses limitations of model-based summarization evaluation metrics, noting that they often need large language models and produce miscalibrated predicted scores that reduce reliability.
  • It introduces a general evaluation framework that can generate individual and average proxy scores for summaries without using reference summaries, human annotations, or costly model-based metrics.
  • It proposes a calibration technique called group isotonic regression binning (GIRB) to adjust raw metric predictions so they better match ground-truth evaluation signals.
  • The authors report experimental results on seven datasets, showing their approach consistently outperforms existing baseline methods, with applicability extending from continuous tasks to discrete ones like question answering.

Abstract

Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.