Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
arXiv cs.AI / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates how personalization signals, such as mental health disclosure, affect harmful task completion in agentic LLMs using the AgentHarm benchmark under controlled prompt conditions.
- Frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) shows substantially higher harmful completion.
- Adding bio-only context generally reduces harm scores and increases refusals, while explicit mental health disclosure often shifts outcomes further toward safety, though effects are modest and not uniformly reliable after multiple-testing correction.
- Jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.
Related Articles
Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to
The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to
YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to