From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model
arXiv cs.CL / 4/30/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that common LLM safety metrics (e.g., refusal rate or binary harmful/not-harmful labels) can miss how risk evolves from a user prompt to the model’s response.
- Using a paired, transition-based analysis of 1,250 labeled prompt–response records across four harm categories (Hate, Sexual, Violence, Self-harm), the study finds 61% of responses de-escalate harm, 36% keep the same severity, and 3% escalate to higher harm.
- The analysis decomposes per-category “persistence vs. drift” and shows Sexual content is about 3x harder to de-escalate than Hate or Violence, mainly due to persistence on already-sexual prompts rather than generating new sexual harm from benign inputs.
- Measuring response relevance alongside risk reveals a “helpfulness–harmlessness” signature: all compliance-to-escalation cases are relevance-3 (high-quality, on-task but with elevated severity), while medium-severity outputs have the lowest relevance (64%), linked to off-target elaboration in Violence and Sexual categories.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to