From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

arXiv cs.CL / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common LLM safety metrics (e.g., refusal rate or binary harmful/not-harmful labels) can miss how risk evolves from a user prompt to the model’s response.
  • Using a paired, transition-based analysis of 1,250 labeled prompt–response records across four harm categories (Hate, Sexual, Violence, Self-harm), the study finds 61% of responses de-escalate harm, 36% keep the same severity, and 3% escalate to higher harm.
  • The analysis decomposes per-category “persistence vs. drift” and shows Sexual content is about 3x harder to de-escalate than Hate or Violence, mainly due to persistence on already-sexual prompts rather than generating new sexual harm from benign inputs.
  • Measuring response relevance alongside risk reveals a “helpfulness–harmlessness” signature: all compliance-to-escalation cases are relevance-3 (high-quality, on-task but with elevated severity), while medium-severity outputs have the lowest relevance (64%), linked to off-target elaboration in Violence and Sexual categories.

Abstract

Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user's input and the model's response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.