How Utilitarian Are OpenAI's Models Really? Replicating and Reinterpreting Pfeffer, Kr\"ugel, and Uhl (2025)

arXiv cs.CL / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper replicates Pfeffer, Krüger, and Uhl (2025) and tests four current OpenAI models on trolley problem and footbridge dilemma prompt variants to evaluate “utilitarian” moral outputs.
  • The original trolley-problem conclusion is not robust: GPT-4o’s low utilitarian rate is largely explained by safety refusals induced by the prompt’s advisory framing, not by a deontological stance.
  • When the prompt is reframed from “Should I...?” to “Is it morally permissible...?,” GPT-4o produces a near-fully utilitarian response rate (99%), and models converge on utilitarian answers once prompt confounds are removed.
  • The footbridge result is partially robust but imperfect: reasoning models often appear more utilitarian, yet may refuse to answer or provide non-utilitarian answers when they do respond.
  • Overall, the study argues that single-prompt evaluations of LLM moral reasoning are unreliable and that multi-prompt robustness testing should be standard for empirical claims.

Abstract

Pfeffer, Kr\"ugel, and Uhl (2025) report that OpenAI's reasoning model o1-mini produces more utilitarian responses to the trolley problem and footbridge dilemma than the non-reasoning model GPT-4o. I replicate their study with four current OpenAI models and extend it with prompt variant testing. The trolley finding does not survive: GPT-4o's low utilitarian rate doesn't reflect a deontological commitment but safety refusals triggered by the prompt's advisory framing. When framed as "Is it morally permissible...?" instead of "Should I...?", GPT-4o gives 99% utilitarian responses. All models converge on utilitarian answers when prompt confounds are removed. The footbridge finding survives with blemishes. Reasoning models tend to give more utilitarian responses than non-reasoning models across prompt variations. But often they refuse to answer the dilemma or, when they answer, give a non-utilitarian rather than a utilitarian answer. These results demonstrate that single-prompt evaluations of LLM moral reasoning are unreliable: multi-prompt robustness testing should be standard practice for any empirical claim about LLM behavior.