Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
arXiv cs.AI / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates how personalization signals, such as mental health disclosure, affect harmful task completion in agentic LLMs using the AgentHarm benchmark under controlled prompt conditions.
- Frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) shows substantially higher harmful completion.
- Adding bio-only context generally reduces harm scores and increases refusals, while explicit mental health disclosure often shifts outcomes further toward safety, though effects are modest and not uniformly reliable after multiple-testing correction.
- Jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.
Related Articles

Interactive Web Visualization of GPT-2
Reddit r/artificial
Stop Treating AI Interview Fraud Like a Proctoring Problem
Dev.to
[R] Causal self-attention as a probabilistic model over embeddings
Reddit r/MachineLearning
The 5 software development trends that actually matter in 2026 (and what they mean for your startup)
Dev.to
InVideo AI Review: Fast Finished
Dev.to