An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations

arXiv cs.CL / 4/10/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper empirically finds that LLMs frequently hallucinate library usage in NL-to-code tasks, producing references to non-existent library features in about 8.1% to 40% of responses.
  • It evaluates static analysis tools for detection and mitigation, reporting that they can detect roughly 16% to 70% of general errors and about 14% to 85% of library hallucinations, with results dependent on both the LLM and the dataset.
  • Manual investigation shows there are hallucination cases that static analysis is unlikely to catch, yielding an estimated upper bound of detectability/mitigation between 48.5% and 77%.
  • Overall, the study concludes static analysis is a relatively low-cost partial remedy for code library hallucinations, but it cannot fully solve the broader hallucination problem.

Abstract

Despite extensive research, Large Language Models continue to hallucinate when generating code, particularly when using libraries. On NL-to-code benchmarks that require library use, we find that LLMs generate code that uses non-existent library features in 8.1-40% of responses.One intuitive approach for detection and mitigation of hallucinations is static analysis. In this paper, we analyse the potential of static analysis tools, both in terms of what they can solve and what they cannot. We find that static analysis tools can detect 16-70% of all errors, and 14-85% of library hallucinations, with performance varying by LLM and dataset. Through manual analysis, we identify cases a static method could not plausibly catch, which gives an upper bound on their potential from 48.5% to 77%. Overall, we show that static analysis methods are cheap method for addressing some forms of hallucination, and we quantify how far short of solving the problem they will always be.