An End-to-End Framework for Building Large Language Models for Software Operations

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces OpsLLM, a domain-specific LLM designed for software operations, supporting both knowledge-based question answering (QA) and root cause analysis (RCA).
  • It proposes a full end-to-end workflow for building LLMs for this domain, including Human-in-the-Loop data curation and creation of a fine-tuning dataset from operational raw data.
  • The model is trained in stages: supervised fine-tuning to form a base model, followed by reinforcement learning enhanced with a domain process reward model (DPRM) to improve RCA accuracy and reliability.
  • Experiments across RCA and QA tasks of varying difficulty show OpsLLM delivers higher performance than existing open-source and closed-source LLMs, with reported gains up to 5.7% for QA and up to 70.3% for RCA.
  • The authors plan to open-source three OpsLLM variants (7B/14B/32B) along with a 15K fine-tuning dataset to enable further research and adoption.

Abstract

In the field of software operations, Large Language Models (LLMs) have attracted increasing attention. However, existing research has not yet achieved efficient and effective end-to-end intelligent operations due to low-quality data, fragmented knowledge and insufficient learning. To explore the potential of LLMs in software operations, we propose OpsLLM, a domain-specific LLM that supports both knowledge-based question answering (QA) and root cause analysis (RCA). Moreover, we disclose the detailed workflow for building LLMs specifically in the software operations domain. First, a Human-in-the-Loop mechanism is introduced to curate highquality data from a large collection of operational raw data and construct a fine-tuning dataset. Then, based on the data, supervised fine-tuning is conducted to achieve a base model. Furthermore, we introduce a domain process reward model (DPRM) during the reinforcement learning stage to optimize the accuracy and reliability of the fine-tuned model on RCA tasks. Experimental results on the tasks with diverse difficulties demonstrate that OpsLLMs effectively learns and aligns with the operational domain knowledge infused, outperforming existing open-source and closed-source LLMs in accuracy with improvements of 0.2%~5.7% on QA tasks and 2.7% ~70.3% on RCA tasks, while exhibiting strong transferability. Moreover, we will open-source three versions of OpsLLM with 7B, 14B and 32B parameters, along with a 15K fine-tuning dataset.