ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

arXiv cs.CL / 4/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces ReFEree, a reference-free, fine-grained evaluation method for factual consistency in real-world code summarization by handling multi-sentence functionality and dependency context.
  • ReFEree defines code-summary-specific factual inconsistency criteria and evaluates them at a segment level using dependency information, then aggregates segment results into a fine-grained score.
  • The authors construct a benchmark for code summarization with human-annotated factual consistency labels to support evaluation and comparison.
  • Experimental results show ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15–18% over the prior state of the art, and the code/data are released publicly.

Abstract

As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.