Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

arXiv cs.AI / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • 提案する「governability」により、命令追従型言語モデルでは実行前に誤りを検知して修正できるという前提が崩れる(silent commitment failureが起きる)ことを実験で示しています。
  • 12の推論領域・6モデルにわたる評価で、対象の2/3のモデルが警告なしに自信満々で誤った出力をコミットする一方、1モデルはgreedy decodingで約57トークン前から検知可能なconflict signalを出しました。
  • ベンチマーク精度はgovernabilityを予測せず、検知能力と修正(correction)能力は独立に変動すること、さらに同じガバナンス用スキャフォールドでもモデルごとに逆方向の効果が出ることを報告しています。
  • アーキテクチャ間では「spike ratio」に最大52倍の差が出る一方、微調整による変化は小さく、governabilityが事前学習段階で固定される可能性を示唆しています。
  • Detection and Correction Matrix(Governable / Monitor Only / Steer Blind / Ungovernable)でモデルとタスクの組み合わせを4領域に分類し、エージェント運用時の安全設計指針を与える枠組みを提示しています。

Abstract

As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures | AI Navigate