ExecTune: Effective Steering of Black-Box LLMs with Guide Models

arXiv cs.LG / 4/14/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • 本論文は、ブラックボックスAPI経由でLLMを使う際に推論コストが学習コストを上回りやすいという課題に対し、ガイドモデルが戦略(中間表現)を生成し、コアLLMがそれを実行する「Guide-Core Policies(GCoP)」という枠組みを整理しています。
  • GCoPの性能は、ガイドが生成した戦略がコアで忠実に実行できる確率(guide-averaged executability)に強く支配されることを理論的に示し、従来手法が実行可能性を十分に最適化できておらず脆い戦略や非効率な計算が起きると指摘しています。
  • これを踏まえて提案された訓練レシピがExecTuneで、受理サンプリング付きのteacher-guided手法、構造に配慮した強化学習、そして教師あり微調整を組み合わせ、構文妥当性・実行成功・コスト効率を同時に最適化します。
  • 数学・コード生成ベンチマークで、ExecTuneを用いたGCoPが先行手法に対して最大9.2%の精度向上と最大22.4%の推論コスト削減を達成し、さらにClaude Haiku 3.5がSonnet 3.5を上回るなど、同じコアを保持したままガイド更新でモジュール的適応も可能だと報告しています。

Abstract

For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.