Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations
arXiv cs.AI / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study compared three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companionship, evaluating 2,100 responses with six clinically grounded safety rubrics.
- Empathy scores were statistically indistinguishable across the three models, indicating empathy did not differ by generation in this assessment.
- Safety posture shifted: crisis detection improved monotonically from GPT-4o to GPT-5-mini, while advice safety declined, with significant p-values reported (Kruskal-Wallis H=13.88, p=0.001 for crisis detection; H=16.63, p<0.001 for advice safety).
- A per-turn trajectory analysis revealed that the largest shifts occur during mid-conversation crisis moments invisible to aggregate scoring.
- In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection early, while GPT-5-mini never dropped below 7.8, illustrating a safety-performance trade-off and its implications for vulnerable users.




