自己誘導型セルフプレイのスケーリング

arXiv cs.LG / 2026/4/23

📰 ニュースIdeas & Deep AnalysisModels & Research

共有:

要点

本論文は、LLMのセルフプレイ手法が長時間の学習で学習が頭打ちになりやすいというスケーリング上の制約を扱っている。
著者らは、その頭打ちの原因がConjecturerが報酬を「攻略」してしまい、Solverの改善につながらない不自然に複雑な問題へ収束することだと主張する。
Self-Guided Self-Play（SGS）として、言語モデルに追加のGuide役を設け、合成問題を未解決のターゲットへの関連度や「きれいさ・自然さ」に基づいて採点し、崩壊を抑制する。
中核となる仮説は、言語モデルがサブ問題が全体目標達成に有用かどうかを評価できるという点にある。
Lean4での形式的定理証明の実験では、SGSが解答率を改善し、最強のRLベースラインを自己プレイ80ラウンド未満で上回り、200ラウンド後には7Bモデルが671Bモデル（pass@4）より多くの問題を解けることを示した。

Abstract

LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core hypothesis is that language models can assess whether a subproblem is useful for achieving a goal. We evaluate the scaling properties of SGS by running training for significantly longer than prior works and by fitting scaling laws to cumulative solve rate curves. Applying SGS to formal theorem proving in Lean4, we find that it surpasses the asymptotic solve rate of our strongest RL baseline in fewer than 80 rounds of self-play and enables a 7B parameter model, after 200 rounds of self-play, to solve more problems than a 671B parameter model pass@4.

ソニーAI、高速・高精度なフィジカルAI 卓球ロボでプロ選手並みに

日経XTECH

OpenAIが臨床現場向けAI「ChatGPT for Clinicians」をリリース、ベンチマークで人間の医師より優れたスコアを出す

GIGAZINE

フィジカルAIのデータ収集方法を選定、実機テレオペ・UMI・Egocentricなど4方式を比較、開発現場で使えるデータ作成方法を解説

Qiita

「AIを使う仕事ほど危機感」「高収入ほどAIによる恩恵」などAnthropicのAIに関する調査結果が公表される

GIGAZINE

OpenAI、個人情報保護モデル「Privacy Filter」を公開　商用利用可能な軽量設計

ITmedia AI+

自己誘導型セルフプレイのスケーリング

要点

Abstract

関連記事

ソニーAI、高速・高精度なフィジカルAI 卓球ロボでプロ選手並みに

OpenAIが臨床現場向けAI「ChatGPT for Clinicians」をリリース、ベンチマークで人間の医師より優れたスコアを出す

フィジカルAIのデータ収集方法を選定、実機テレオペ・UMI・Egocentricなど4方式を比較、開発現場で使えるデータ作成方法を解説

「AIを使う仕事ほど危機感」「高収入ほどAIによる恩恵」などAnthropicのAIに関する調査結果が公表される

OpenAI、個人情報保護モデル「Privacy Filter」を公開　商用利用可能な軽量設計

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

Abstract

関連記事

ソニーAI、高速・高精度なフィジカルAI 卓球ロボでプロ選手並みに

OpenAIが臨床現場向けAI「ChatGPT for Clinicians」をリリース、ベンチマークで人間の医師より優れたスコアを出す

フィジカルAIのデータ収集方法を選定、実機テレオペ・UMI・Egocentricなど4方式を比較、開発現場で使えるデータ作成方法を解説

「AIを使う仕事ほど危機感」「高収入ほどAIによる恩恵」などAnthropicのAIに関する調査結果が公表される

OpenAI、個人情報保護モデル「Privacy Filter」を公開 商用利用可能な軽量設計

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

OpenAI、個人情報保護モデル「Privacy Filter」を公開　商用利用可能な軽量設計