Stress Testing Factual Consistency Metrics for Long-Document Summarization
arXiv cs.CL / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper tests how reliable six commonly used reference-free factual consistency metrics are when applied to long-document summarization, where typical metrics break down due to length limits and long-range dependencies.
- It evaluates metric robustness using seven meaning-preserving summary perturbations (e.g., paraphrasing, simplification, synonym swaps, logically equivalent negations, compression, and source-text insertion) and studies how sensitivity changes with retrieval context and claim information density.
- Experiments on three long-form benchmark datasets (science fiction, legal, and scientific) show that metrics designed for short summaries give inconsistent scores for semantically equivalent outputs and become less dependable for information-dense claims.
- Increasing retrieval context can improve stability in some domains, but the study finds no metric that consistently preserves factual alignment under long-context conditions.
- The authors propose concrete improvement directions for factuality evaluation, including multi-span reasoning, context-aware calibration, and training/evaluation on meaning-preserving variations, and they release code and reproduction materials.
Related Articles
Building a Local AI Agent (Part 2): Six UX and UI Design Challenges
Dev.to
We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to
Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to
Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to