When Bigger Isn't Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation

arXiv cs.CL / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Multi-document news summarisation systems can introduce political bias by uneven representation, skewed emphasis, and systematic underrepresentation of minority viewpoints.
  • The study evaluates political fairness in multi-news summarisation using FairNews (labeled full articles) across 13 LLMs and five fairness metrics.
  • Results show that larger models do not necessarily produce fairer summaries; mid-sized LLMs consistently outperform larger ones in balancing fairness and efficiency.
  • Debiasing interventions vary in effectiveness: prompt-based methods are highly model dependent, while entity sentiment is the most resistant fairness dimension, failing to improve under tested strategies.
  • The paper concludes that achieving fairness requires multi-dimensional evaluation and architecture-aware debiasing approaches rather than relying on model scaling alone.

Abstract

Multi-document news summarisation systems are increasingly adopted for their convenience in processing vast daily news content, making fairness across diverse political perspectives critical. However, these systems can exhibit political bias through unequal representation of viewpoints, disproportionate emphasis on certain perspectives, and systematic underrepresentation of minority voices. This study presents a comprehensive evaluation of such bias in multi-document news summarisation using FairNews, a dataset of complete news articles with political orientation labels, examining how large language models (LLMs) handle sources with varying political leanings across 13 models and five fairness metrics. We investigate both baseline model performance and effectiveness of various debiasing interventions, including prompt-based and judge-based approaches. Our findings challenge the assumption that larger models yield fairer outputs, as mid-sized variants consistently outperform their larger counterparts, offering the best balance of fairness and efficiency. Prompt-based debiasing proves highly model dependent, while entity sentiment emerges as the most stubborn fairness dimension, resisting all intervention strategies tested. These results demonstrate that fairness in multi-document news summarisation requires multi-dimensional evaluation frameworks and targeted, architecture-aware debiasing rather than simply scaling up.