AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

arXiv cs.AI / 4/15/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • AnyPoCは、LLMベースのバグ報告を「実行可能なPoC(スクリプト/コマンド/入力)」へ変換し、手動検証のボトルネックをテスト生成で解消することを目的としたフレームワークです。
  • 生成したPoCが成功に“偏る”ことや、報酬ハッキング/幻覚によって非機能なPoCや虚偽の実行痕跡を作り得る点を、多エージェントでの事実確認・反復実行・独立再実行と精査で抑制します。
  • AnyPoCは異なるソースの候補バグ報告にも対応でき、PoC知識ベースを抽出・進化させて多様なタスクへ拡張可能としています。
  • Firefox/Chromium/LLVM/OpenSSL/SQLite/FFmpeg/Redisなど12の大規模ソフトに適用し、既存のコーディングエージェントより真陽性で有効PoCが1.3倍、偽陽性のPoC拒否が9.8倍改善したと報告されています。
  • これまでに122件の新規バグを発見し、うち105件が確認され、45件のPoCが公式の回帰テストとして採用されたと述べています。

Abstract

While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward "success" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.