LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

arXiv cs.CL / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper finds that traditional lexical overlap metrics like ROUGE and BLEU correlate weakly (or even negatively) with human judgments of summary quality across multiple domains and document lengths.
Task-specific neural metrics and LLM-based evaluators align much better with human assessments, especially for evaluating linguistic quality.
Building on these results, it introduces LLM-ReSum, a self-reflective summarization framework that uses an LLM evaluation-and-rewrite loop without any model fine-tuning.
Experiments across three domains show LLM-ReSum can improve low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring the refined summaries in 89% of cases.
The work also releases PatentSumEval, a new human-annotated benchmark for legal document summarization with 180 expert-evaluated summaries, along with plans to publish code and datasets on GitHub.

Abstract

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

Black Hat USA

AI Business

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

Dev.to

An API testing tool built specifically for AI agent loops

Dev.to

IK_LLAMA now supports Qwen3.5 MTP Support :O

Reddit r/LocalLLaMA

OpenAI models, Codex, and Managed Agents come to AWS

Dev.to

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

Key Points

Abstract

Related Articles

Black Hat USA

How I Use AI Agents to Maintain a Living Knowledge Base for My Team

An API testing tool built specifically for AI agent loops

IK_LLAMA now supports Qwen3.5 MTP Support :O

OpenAI models, Codex, and Managed Agents come to AWS

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer