State-Dependent Safety Failures in Multi-Turn Language Model Interaction
arXiv cs.AI / 3/18/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- STAR, a state-oriented diagnostic framework, treats dialogue history as a state transition operator to analyze safety behavior across multi-turn LLM interactions.
- The study shows that many safety failures arise from structured contextual state evolution rather than isolated prompt vulnerabilities.
- Across multiple frontier language models, the paper finds that models that seem robust under static evaluation can exhibit rapid and reproducible safety collapse under structured multi-turn interactions.
- Mechanistic analysis reveals monotonic drift away from refusal-related representations and abrupt phase transitions induced by role-conditioned context.
- The work argues for viewing language model safety as a dynamic, trajectory-dependent process and motivates new evaluation methods that consider conversational state.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)
Dev.to
The Obligor
Dev.to
The Markup
Dev.to
2026 年 AI 部落格變現完整攻略:從第一篇文章到月收入 $1000
Dev.to