Stress Testing Factual Consistency Metrics for Long-Document Summarization

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper tests how reliable six commonly used reference-free factual consistency metrics are when applied to long-document summarization, where typical metrics break down due to length limits and long-range dependencies.
It evaluates metric robustness using seven meaning-preserving summary perturbations (e.g., paraphrasing, simplification, synonym swaps, logically equivalent negations, compression, and source-text insertion) and studies how sensitivity changes with retrieval context and claim information density.
Experiments on three long-form benchmark datasets (science fiction, legal, and scientific) show that metrics designed for short summaries give inconsistent scores for semantically equivalent outputs and become less dependable for information-dense claims.
Increasing retrieval context can improve stability in some domains, but the study finds no metric that consistently preserves factual alignment under long-context conditions.
The authors propose concrete improvement directions for factuality evaluation, including multi-span reasoning, context-aware calibration, and training/evaluation on meaning-preserving variations, and they release code and reproduction materials.

Abstract

Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

Dev.to

Stress Testing Factual Consistency Metrics for Long-Document Summarization

Key Points

Abstract

Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD

Function Calling Harness 2: CoT Compliance from 9.91% to 100%

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer