Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

arXiv cs.AI / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

提案する「governability」により、命令追従型言語モデルでは実行前に誤りを検知して修正できるという前提が崩れる（silent commitment failureが起きる）ことを実験で示しています。
12の推論領域・6モデルにわたる評価で、対象の2/3のモデルが警告なしに自信満々で誤った出力をコミットする一方、1モデルはgreedy decodingで約57トークン前から検知可能なconflict signalを出しました。
ベンチマーク精度はgovernabilityを予測せず、検知能力と修正（correction）能力は独立に変動すること、さらに同じガバナンス用スキャフォールドでもモデルごとに逆方向の効果が出ることを報告しています。
アーキテクチャ間では「spike ratio」に最大52倍の差が出る一方、微調整による変化は小さく、governabilityが事前学習段階で固定される可能性を示唆しています。
Detection and Correction Matrix（Governable / Monitor Only / Steer Blind / Ungovernable）でモデルとタスクの組み合わせを4領域に分類し、エージェント運用時の安全設計指針を与える枠組みを提示しています。

Abstract

As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.

💡 この記事が使われたインサイト

AIの最新ニュースをまとめた「今日の要点」で、この記事が取り上げられています。

📅 3/24Dailyインサイトを見る →

競艇×AI連動──流れを読む女、MIRIA。3/24(火)予告 🖤 本日のMIRIA式ブースト爆発的回収ならず😭惜しい展開続きました💦【MIRIA式競艇予想】

note

イーロン・マスク氏、AI半導体を1テラワット製造 8割を宇宙へ

日経XTECH

生成AIが「下手な鉄砲」型サイバー攻撃を増やす、足元固めを急ごう

日経XTECH

文書の内容を学習なしでLLMに反映、Sakana AIの新技術 RAG代替は可能か

日経XTECH

Google Stitch「バイブデザイン」登場—自然言語でUIを作る時代へ

Innovatopia

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

要点

Abstract

💡 この記事が使われたインサイト

関連記事

競艇×AI連動──流れを読む女、MIRIA。3/24(火)予告 🖤 本日のMIRIA式ブースト爆発的回収ならず😭惜しい展開続きました💦【MIRIA式競艇予想】

イーロン・マスク氏、AI半導体を1テラワット製造 8割を宇宙へ

生成AIが「下手な鉄砲」型サイバー攻撃を増やす、足元固めを急ごう

文書の内容を学習なしでLLMに反映、Sakana AIの新技術 RAG代替は可能か

Google Stitch「バイブデザイン」登場—自然言語でUIを作る時代へ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer