Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

arXiv cs.CL / 4/6/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the difficulty of evaluating long-form LLM factuality when outputs are open-ended and contain many fine-grained claims.
  • It argues that existing claim-based evaluators overemphasize precision and largely miss recall—the extent to which the model covers the relevant facts that should appear.
  • The authors propose a framework that jointly measures precision and recall by generating reference facts from external knowledge sources and checking whether those facts are present in the generated text.
  • An importance-aware weighting scheme is introduced to prioritize facts based on relevance and salience during evaluation.
  • The analysis finds that current LLMs are much stronger on precision than recall, indicating factual incompleteness is a key limitation in long-form generation, especially for less “important” facts.

Abstract

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.