Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

arXiv cs.CL / 4/13/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The paper introduces Jackal, the first execution-based benchmark for translating natural language into Jira Query Language (JQL), including 100,000 validated NL–JQL pairs grounded in a live Jira instance with 200,000+ issues.
  • It argues that single-pass LLMs struggle with instance-specific categorical values, ambiguous field references, and Boolean predicate generation, because they cannot verify queries against live data.
  • To address this, the authors propose Agentic Jackal, a tool-augmented agent that uses the Jira MCP server for live execution and JiraAnchor for semantic retrieval of categorical values via embedding-based similarity.
  • Across nine frontier LLMs, single-pass models achieve an average of 43.4% execution accuracy on short queries, and the agentic approach improves 7 of 9 models, with a 9.0% relative gain on the hardest linguistic variant.
  • An ablation shows JiraAnchor’s impact is large (categorical-value accuracy rising from 48.7% to 71.7%), and the study finds dominant failure modes are semantic ambiguities like issue-type and text-field selection rather than only value-resolution errors.

Abstract

Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.