ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

arXiv cs.CL / 4/14/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces ReFEree, a reference-free, fine-grained evaluation method for factual consistency in real-world code summarization by handling multi-sentence functionality and dependency context.
ReFEree defines code-summary-specific factual inconsistency criteria and evaluates them at a segment level using dependency information, then aggregates segment results into a fine-grained score.
The authors construct a benchmark for code summarization with human-annotated factual consistency labels to support evaluation and comparison.
Experimental results show ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15–18% over the prior state of the art, and the code/data are released publicly.

Abstract

As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.

Black Hat USA

AI Business

Black Hat Asia

AI Business

Small NSFW model for chatbot

Reddit r/LocalLLaMA

ChatGPT for Nurses: Prompts That Help You Document, Communicate, and Study

Dev.to

I Added a Stopwatch to My AI in 1 LOC Using the Livingrimoire While Corporations Need a Year

Dev.to

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

Key Points

Abstract

Related Articles

Black Hat USA

Black Hat Asia

Small NSFW model for chatbot

ChatGPT for Nurses: Prompts That Help You Document, Communicate, and Study

I Added a Stopwatch to My AI in 1 LOC Using the Livingrimoire While Corporations Need a Year

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer