AI Navigate

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

arXiv cs.CL / 3/18/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces 'moral reasoning trajectories'—sequences of ethical framework invocations across intermediate reasoning steps in large language models—and analyzes them across six models and three benchmarks.
  • It finds systematic multi-framework deliberation, with 55.4–57.7% of consecutive steps involving framework switches and only 16.4–17.8% of trajectories remaining framework-consistent.
  • Unstable trajectories are 1.29× more susceptible to persuasive attacks (p=0.015), and representation-level probes show framework-specific encoding localized to model-dependent layers (e.g., Llama-3.3-70B at layer 63/81; Qwen2.5-72B at layer 17/81), achieving 13.8–22.6% lower KL divergence than a training-set prior baseline.
  • Lightweight activation steering reduces drift in framework integration by 6.7–8.9% and clarifies the stability–accuracy relationship, while introducing Moral Representation Consistency (MRC), a metric that correlates with coherence ratings (r=0.715, p<0.0001) and is validated by human annotators (mean cosine similarity 0.859).

Abstract

Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29\times more susceptible to persuasive attacks (p=0.015). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly (r=0.715, p<0.0001) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity = 0.859).