Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

arXiv cs.AI / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

提案する「governability」により、命令追従型言語モデルでは実行前に誤りを検知して修正できるという前提が崩れる（silent commitment failureが起きる）ことを実験で示しています。
12の推論領域・6モデルにわたる評価で、対象の2/3のモデルが警告なしに自信満々で誤った出力をコミットする一方、1モデルはgreedy decodingで約57トークン前から検知可能なconflict signalを出しました。
ベンチマーク精度はgovernabilityを予測せず、検知能力と修正（correction）能力は独立に変動すること、さらに同じガバナンス用スキャフォールドでもモデルごとに逆方向の効果が出ることを報告しています。
アーキテクチャ間では「spike ratio」に最大52倍の差が出る一方、微調整による変化は小さく、governabilityが事前学習段階で固定される可能性を示唆しています。
Detection and Correction Matrix（Governable / Monitor Only / Steer Blind / Ungovernable）でモデルとタスクの組み合わせを4領域に分類し、エージェント運用時の安全設計指針を与える枠組みを提示しています。

Abstract

As large language models are deployed as autonomous agents with tool execution privileges, a critical assumption underpins their security architecture: that model errors are detectable at runtime. We present empirical evidence that this assumption fails for two of three instruction-following models evaluable for conflict detection. We introduce governability -- the degree to which a model's errors are detectable before output commitment and correctable once detected -- and demonstrate it varies dramatically across models. In six models across twelve reasoning domains, two of three instruction-following models exhibited silent commitment failure: confident, fluent, incorrect output with zero warning signal. The remaining model produced a detectable conflict signal 57 tokens before commitment under greedy decoding. We show benchmark accuracy does not predict governability, correction capacity varies independently of detection, and identical governance scaffolds produce opposite effects across models. A 2x2 experiment shows a 52x difference in spike ratio between architectures but only +/-0.32x variation from fine-tuning, suggesting governability is fixed at pretraining. We propose a Detection and Correction Matrix classifying model-task combinations into four regimes: Governable, Monitor Only, Steer Blind, and Ungovernable.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/24DailyView insight →

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

Dev.to

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

Dev.to

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

Dev.to

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

THE DECODER

Silent Commitment Failure in Instruction-Tuned Language Models: Evidence of Governability Divergence Across Architectures

Key Points

Abstract

💡 Insights using this article

Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer