AI Navigate

PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

arXiv cs.CL / 3/20/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • PlainQAFact is a retrieval-augmented metric designed to evaluate factual consistency in biomedical plain language summarization and aims to mitigate hallucinations in medical ML outputs.
  • It first classifies each sentence by type and then applies a retrieval-augmented QA scoring method, enabling sentence-aware evaluation.
  • The metric is trained on the human-annotated PlainFact dataset and targets both source-simplified and elaborately explained sentences.
  • Empirically, PlainQAFact outperforms existing factual-consistency metrics across varying evaluation settings, especially for elaborative explanations.
  • The work also analyzes the influence of external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity, providing a new benchmark and practical tool for safe plain-language medical communication.

Abstract

Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents a sentence-aware, retrieval-augmented metric targeted at elaborative explanations in biomedical PLS tasks, providing the community with both a new benchmark and a practical evaluation tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact