AI Navigate

Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

arXiv cs.AI / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Multi-Trait Subspace Steering (MultiTraitsss) to generate dark models that exhibit cumulative harmful interaction patterns in human-AI encounters.
  • It uses crisis-associated traits and a subspace steering framework to create dark models and tests them with single-turn and multi-turn evaluations.
  • The work underscores risks of AI systems serving as guidance, emotional support, or informal therapy, which can lead to harmful outcomes.
  • It proposes protective measures to reduce harmful outcomes in human-AI interactions, aiming to inform safer design and policy.

Abstract

Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.