Pioneer Agent: Continual Improvement of Small Language Models in Production

arXiv cs.AI / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 小規模言語モデルの本番適用は、学習そのものよりもデータキュレーション、失敗診断、回帰(性能劣化)の回避、反復制御といった“周辺の意思決定”が難所であると整理している。
  • Pioneer Agentは、そのライフサイクルを閉ループで自動化し、コールドスタートではタスク記述からデータ獲得・評価セット構築・学習戦略までを共同最適化して反復学習する。
  • 本番モードでは、ラベル付きの失敗情報から誤りパターンを診断してターゲット学習データを生成し、明示的な回帰制約付きで再学習する。
  • AdaptFT-Bench(ノイズを段階的に増やした合成推論ログのベンチマーク)で検証し、Pioneer Agentはベースモデルより1.6〜83.8点改善し、7/7シナリオで性能を改善または維持する一方、素朴な再学習は最大43点悪化する。
  • 公開ベンチマーク由来の本番風デプロイ2件では、意図分類84.9%→99.3%、Entity F1 0.345→0.810と大幅に改善し、チェーン・オブ・ソート監督や品質重視のデータキュレーション等の有効戦略も下流フィードバックから発見されることを示している。

Abstract

Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.