BLAST: Benchmarking LLMs with ASP-based Structured Testing

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces BLAST, the first dedicated benchmarking methodology and dataset for evaluating how accurately LLMs generate Answer Set Programming (ASP) code.
  • BLAST uses a structured evaluation framework that includes two new semantic metrics specifically designed to assess ASP code generation quality.
  • The authors report an empirical study testing eight state-of-the-art LLMs across ten well-known graph-related ASP problems from the ASP literature.
  • The work highlights a research gap: while LLMs are strong on many tasks, their effectiveness for declarative paradigms like ASP has received relatively less attention so far.
  • Results are presented as an initial evaluation using graph-centric ASP benchmarks, aiming to enable more rigorous and comparable future assessments of LLM-to-ASP generation.
  • Point 2
  • Point 3

Abstract

Large Language Models (LLMs) have demonstrated remarkable performance across a broad spectrum of tasks, including natural language understanding, dialogue systems, and code generation. Despite evident progress, less attention has been paid to their effectiveness in handling declarative paradigms such as Answer Set Programming (ASP), to date. In this paper we introduce BLAST: The first dedicated benchmarking methodology and associated dataset for evaluating the accuracy of LLMs in generating ASP code. BLAST provides a structured evaluation framework featuring two novel semantic metrics tailored to ASP code generation. The paper presents the results of an empirical evaluation involving ten well-established graph-related problems from the ASP literature and a diverse set of eight state-of-the-art LLMs.