Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

arXiv cs.CL / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper presents Chat2Workflow, a benchmark for generating executable visual workflows from natural language to reduce the heavy manual engineering currently required in industrial deployments.
  • It emphasizes that existing systems often fail to reliably produce correct, stable, and runnable workflows when requirements are complex or change over time.
  • Chat2Workflow is constructed from a large set of real-world business workflow examples and is designed to produce outputs that can be transformed and deployed on platforms such as Dify and Coze.
  • The proposed agentic framework aims to mitigate recurrent execution errors and improves performance by up to a 5.34% resolve-rate gain, but still leaves a real-world gap.
  • The authors release code on GitHub, positioning Chat2Workflow as a foundation to advance industrial-grade workflow automation.

Abstract

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.