Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

arXiv cs.CL / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces 'moral reasoning trajectories'—sequences of ethical framework invocations across intermediate reasoning steps in large language models—and analyzes them across six models and three benchmarks.
It finds systematic multi-framework deliberation, with 55.4–57.7% of consecutive steps involving framework switches and only 16.4–17.8% of trajectories remaining framework-consistent.
Unstable trajectories are 1.29× more susceptible to persuasive attacks (p=0.015), and representation-level probes show framework-specific encoding localized to model-dependent layers (e.g., Llama-3.3-70B at layer 63/81; Qwen2.5-72B at layer 17/81), achieving 13.8–22.6% lower KL divergence than a training-set prior baseline.
Lightweight activation steering reduces drift in framework integration by 6.7–8.9% and clarifies the stability–accuracy relationship, while introducing Moral Representation Consistency (MRC), a metric that correlates with coherence ratings (r=0.715, p<0.0001) and is validated by human annotators (mean cosine similarity 0.859).

Abstract

Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29

\times

more susceptible to persuasive attacks (

p=0.015

). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly (

r=0.715

p<0.0001

) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity

= 0.859

AIに心を持たせる試みについて

note

AIと創作

note

まな式AI活用術で、人生が動き出した人たち

note

人間とLLMは、次に来る言葉をどう予測するのか

note

【AI時代の読書術】本を読んでも忘れる50歳が、AIで読書を「資産」に変えた泥臭い仕組み

note

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Key Points

Abstract

Related Articles

AIに心を持たせる試みについて

AIと創作

まな式AI活用術で、人生が動き出した人たち

人間とLLMは、次に来る言葉をどう予測するのか

【AI時代の読書術】本を読んでも忘れる50歳が、AIで読書を「資産」に変えた泥臭い仕組み

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer