AI Navigate

Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

arXiv cs.AI / 3/12/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study compared three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companionship, evaluating 2,100 responses with six clinically grounded safety rubrics.
  • Empathy scores were statistically indistinguishable across the three models, indicating empathy did not differ by generation in this assessment.
  • Safety posture shifted: crisis detection improved monotonically from GPT-4o to GPT-5-mini, while advice safety declined, with significant p-values reported (Kruskal-Wallis H=13.88, p=0.001 for crisis detection; H=16.63, p<0.001 for advice safety).
  • A per-turn trajectory analysis revealed that the largest shifts occur during mid-conversation crisis moments invisible to aggregate scoring.
  • In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection early, while GPT-5-mini never dropped below 7.8, illustrating a safety-performance trade-off and its implications for vulnerable users.

Abstract

When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had "lost their empathy." No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p<0.001). Per-turn trajectory analysis -- a novel methodological contribution -- reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as "lost empathy" was a shift from a cautious model that missed crises to an alert model that sometimes says too much -- a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.