Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

arXiv cs.AI / 5/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Reinforcement fine-tuning (RFT) for large language models is widely used for post-training, but the training process is fragile and lacks automated failure management.
The paper introduces RFT-FaultBench, a new benchmark with multiple categories of failure (5 fault families, 16 fault types) and a large collection of training run and trajectory records to study failures in detail.
It finds that RFT failures are both detectable from training dynamics and identifiable via empirical “fault fingerprints.”
Building on these insights, the authors propose RFT-FM, a closed-loop framework that combines anomaly detection, failure diagnosis, and automatic remediation.
Experiments indicate that the benchmark reveals non-trivial, non-saturated failure patterns (including subtle faults), and that RFT-FM can detect, diagnose, and mitigate such failures effectively.

Abstract

Reinforcement fine-tuning (RFT) has become a core paradigm for post-training large language models, yet its training process remains highly fragile. Existing efforts mainly improve reliability at the system level or address specific issues in individual subproblems by modifying RFT algorithms. Despite their effectiveness, they largely overlook the problem of failure management at the training-process level. When training goes wrong, practitioners still rely heavily on expert-driven manual inspection and correction, and automatic failure management for RFT remains largely unexplored. In this paper, we take a first step toward systematic failure management for reinforcement fine-tuning. To understand the empirical structure of RFT failures, we first construct RFT-FaultBench, the first benchmark for fine-grained failures in reinforcement fine-tuning, covering 5 fault families, 16 fault types, 779 training runs, 22,549 train-step records, and 1,457,288 trajectory-level records. Based on this benchmark, we conduct a comprehensive empirical study showing that RFT failures are both observable from training dynamics and distinguishable through their empirical fault fingerprints. Building on these findings, we propose RFT-FM, an automatic failure management framework for reinforcement fine-tuning that unifies anomaly detection, failure diagnosis, and auto remediation in a closed loop. Experimental results show that RFT-FaultBench is neither trivial nor saturated: it exhibits clear anomaly structure while still posing substantial challenges, especially under subtle fault settings. Moreover, RFT-FM shows strong capability in detecting, diagnosing, and mitigating RFT failures.