Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The paper argues that current multi-role dialogue summarization methods over-optimize surface similarity metrics (e.g., ROUGE/BERTScore) instead of improving faithfulness and alignment with human preferences.
It introduces a reasoning-aware framework that distills step-by-step “cognitive-style” reasoning traces from a large teacher model, then uses staged supervised fine-tuning to initialize a summarizer.
The approach further applies GRPO with a dual-principle reward that combines metric-based signals with human-aligned criteria covering information coverage, implicit inference, factual faithfulness, and conciseness.
Experiments on multilingual benchmarks show comparable ROUGE/BERTScore to strong baselines, with stronger improvements in factual faithfulness and preference alignment (notably on SAMSum) and stability in semantic consistency (on CSDS).
The authors provide checkpoints and datasets via a Hugging Face collection, enabling replication and further research.

Abstract

Multi-role dialogue summarization requires modeling complex interactions among multiple speakers while preserving role-specific information and factual consistency. However, most existing methods optimize for automatic metrics such as ROUGE and BERTScore, which favor surface-level imitation of references rather than genuine gains in faithfulness or alignment with human preferences. We propose a novel framework that couples explicit cognitive-style reasoning with reward-based optimization for multi-role dialogue summarization. Our method first distills structured reasoning traces (e.g., step-by-step inferences and intermediate reflections) from a large teacher model and uses them as auxiliary supervision to initialize a reasoning-aware summarizer via staged supervised fine-tuning. It then applies GRPO with a dual-principle reward that blends metric-based signals with human-aligned criteria targeting key information coverage, implicit inference, factual faithfulness, and conciseness. Experiments on multilingual multi-role dialogue benchmarks show that our method matches strong baselines on ROUGE and BERTScore. Specifically, results on CSDS confirm the framework's stability in semantic consistency, while in-depth analysis on SAMSum demonstrates clear gains in factual faithfulness and model-based preference alignment. These findings underscore the value of reasoning-aware and preference-aware training for reliable dialogue summarization. Checkpoints and datasets are available at https://huggingface.co/collections/NebulaPixel/summorchestra-multirole-summary.