Prompt-Driven Code Summarization: A Systematic Literature Review

arXiv cs.LG / 4/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that high-quality software documentation is crucial for comprehension and maintenance, but manual creation is slow and often inconsistent.
  • It reviews how LLMs can generate code summaries from source code, while emphasizing that results strongly depend on prompt design.
  • The systematic literature review organizes prompting approaches (e.g., few-shot prompting, chain-of-thought, retrieval-augmented generation, and zero-shot learning) and assesses their reported effectiveness.
  • The authors highlight that current evidence is fragmented and that it remains unclear which prompting strategies work best across different models and conditions.
  • The review also points out evaluation gaps, noting that many studies use overlap-based metrics that may fail to reflect semantic quality, and it outlines open directions for future research and adoption.

Abstract

Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.