軌道コミットメントとしてのハルシネーション：トランスフォーマー生成における非対称アトラクタ・ダイナミクスの因果的証拠

arXiv cs.AI / 2026/4/20

💬 オピニオンIdeas & Deep AnalysisModels & Research

共有:

要点

この論文は、自 autoregressive 言語モデルにおけるハルシネーションが、非対称なアトラクタ・ダイナミクスに支配される「初期の軌道コミットメント」として振る舞うことを因果的に示しています。
同一プロンプトの分岐（same-prompt bifurcation）手法により、事実に基づく生成とハルシネーション生成の分岐が、最初の生成トークンで即座に起こり得ることを、KL差の大きさを用いて定量化しています。
28層にわたるアクティベーション・パッチングでは、因果的な非対称性が強く観測されており、「正しい軌道へハルシネーション由来の活性を注入する」方が「その逆」よりも出力を大きく損なう確率が高いことが示されました。
ウィンドウ（多段階）でのパッチングから、ハルシネーションの修正には複数の生成ステップにまたがる持続的介入が必要である一方、汚染（ハルシネーション化）は単発の摂動だけで引き起こせることが示唆されています。
プロンプト符号化のステップ0における残差状態が、各プロンプトのハルシネーション率を予測でき、クラスタリングによって「レジームのような」複数の群があり、それらの構造が誤った前提へ分岐するプロンプトに集中していることが示されています。

Abstract

We present causal evidence that hallucination in autoregressive language models is an early trajectory commitment governed by asymmetric attractor dynamics. Using same-prompt bifurcation, in which we repeatedly sample identical inputs to observe spontaneous divergence, we isolate trajectory dynamics from prompt-level confounds. On Qwen2.5-1.5B across 61 prompts spanning six categories, 27 prompts (44.3%) bifurcate with factual and hallucinated trajectories diverging at the first generated token (KL = 0 at step 0, KL > 1.0 at step 1). Activation patching across 28 layers reveals a pronounced causal asymmetry: injecting a hallucinated activation into a correct trajectory corrupts output in 87.5% of trials (layer 20), while the reverse recovers only 33.3% (layer 24); both exceed the 10.4% baseline (p = 0.025) and 12.5% random-patch control. Window patching shows correction requires sustained multi-step intervention, whereas corruption needs only a single perturbation. Probing the prompt encoding itself, step-0 residual states predict per-prompt hallucination rate at Pearson r = 0.776 at layer 15 (p < 0.001 against a 1000-permutation null); unsupervised clustering identifies five regime-like groups (eta^2 = 0.55) whose saddle-adjacent cluster concentrates 12 of the 13 bifurcating false-premise prompts, indicating that the basin structure is organized around regime commitments fixed at prompt encoding. These findings characterize hallucination as a locally stable attractor basin: entry is probabilistic and rapid, exit demands coordinated intervention across layers and steps, and the relevant basins are selected by clusterable regimes already discernible at step 0.

推論では余裕の8GBが、ファインチューニングでは即死する — 学習が推論の8倍のVRAMを食う理由

Qiita

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

Innovatopia

北京ヒューマノイドロボットハーフマラソンで優勝記録更新、CursorがバリュエーションUS$50Bでの調達協議など：2026-04-20 AI動向まとめ

Qiita

イーロン・マスクがAIによる解雇に対し給付金を送る「ユニバーサル・ハイインカム」で対応すべきと発言し批判が殺到

GIGAZINE

Anthropicとホワイトハウス、Mythosへの懸念高まりを受けて“仲直り”を模索か

ITmedia AI+

軌道コミットメントとしてのハルシネーション：トランスフォーマー生成における非対称アトラクタ・ダイナミクスの因果的証拠

要点

Abstract

関連記事

推論では余裕の8GBが、ファインチューニングでは即死する — 学習が推論の8倍のVRAMを食う理由

Appleが「声なき入力」に約3000億円を投じた理由｜Q.ai買収とAirPods Pro 3の接点

北京ヒューマノイドロボットハーフマラソンで優勝記録更新、CursorがバリュエーションUS$50Bでの調達協議など：2026-04-20 AI動向まとめ

イーロン・マスクがAIによる解雇に対し給付金を送る「ユニバーサル・ハイインカム」で対応すべきと発言し批判が殺到

Anthropicとホワイトハウス、Mythosへの懸念高まりを受けて“仲直り”を模索か

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer