From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

arXiv cs.CL / 4/30/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that common LLM safety metrics (e.g., refusal rate or binary harmful/not-harmful labels) can miss how risk evolves from a user prompt to the model’s response.
Using a paired, transition-based analysis of 1,250 labeled prompt–response records across four harm categories (Hate, Sexual, Violence, Self-harm), the study finds 61% of responses de-escalate harm, 36% keep the same severity, and 3% escalate to higher harm.
The analysis decomposes per-category “persistence vs. drift” and shows Sexual content is about 3x harder to de-escalate than Hate or Violence, mainly due to persistence on already-sexual prompts rather than generating new sexual harm from benign inputs.
Measuring response relevance alongside risk reveals a “helpfulness–harmlessness” signature: all compliance-to-escalation cases are relevance-3 (high-quality, on-task but with elevated severity), while medium-severity outputs have the lowest relevance (64%), linked to off-target elaboration in Violence and Sexual categories.

Abstract

Safety evaluations of large language models (LLMs) typically report binary outcomes such as attack success rate, refusal rate, or harmful/not-harmful response classification. While useful, these can hide how risk changes between a user's input and the model's response. We present a paired, transition-based analysis over 1250 prompt-response records with human-provided labels over four harm categories (Hate, Sexual, Violence, Self-harm) and ordinal severity levels aligned with the Azure AI Content Safety taxonomy. 61% of responses de-escalate harm relative to the prompt, 36% preserve the same severity, and 3% escalate to higher harm. A per-category persistence/drift-up decomposition identifies Sexual content as 3x harder to de-escalate than Hate or Violence, driven by persistence on already-sexual prompts, not by newly introducing sexual harm from benign inputs. Jointly measuring response relevance reveals an empirical signature of the helpfulness-harmlessness tradeoff: all compliance-escalation cases (from non-zero prompts) are relevance-3 (high-quality, on-task content at elevated severity), while medium-severity responses show the lowest relevance (64%), driven by tangential elaborations in Violence and Sexual categories.

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Key Points

Abstract

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer