How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • アライメント安全研究では倫理的指示が行動改善につながると仮定される一方で、言語モデルが内部でそれらをどう処理するかは不明だとして、4モデル(Llama 3.3 70B/GPT-4o mini/Qwen3-Next-80B-A3B/Sonnet 4.5)で600件超のマルチエージェント・シミュレーションを実施しました。
  • 先行研究で見られた「Llamaの日本語におけるdissociationパターン」は再現されましたが、他の3モデルでは同様の再現ができず、当該内部処理挙動がモデル固有であることを示しました。
  • 新たにDeliberation Depth(DD)、Value Consistency Across Dilemmas(VCAD)、Other-Recognition Index(ORI)を提案し、その結果として「Output Filter(GPT)」「Defensive Repetition(Llama)」「Critical Internalization(Qwen)」「Principled Consistency(Sonnet)」の4種類の倫理処理タイプが現れたと報告しています。
  • 結果の要点として、処理能力(DD)と指示形式の相互作用が大きく、低DDでは指示形式が内部処理にほぼ影響せず、高DDではreasoned normやvirtue framingが逆方向の効果を生むことを発見しました。
  • さらに、倫理指示への語彙レベルのコンプライアンスは内部処理指標と相関しなかったため、安全性・指示遵守・倫理的内部処理は概ね独立(分離)している可能性が示唆され、形式的遵守のみはリスク信号になり得る点も議論されています。

Abstract

Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study (\mathrm{BF}_{10} > 10 for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level (r = -0.161 to +0.256, all p > .22; N = 24; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.