Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

arXiv cs.CL / 4/13/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper investigates why LLM safety measures are brittle by using targeted weight pruning to test how harmful behavior is organized internally.
It finds that harmful content generation relies on a compact subset of weights that is shared across multiple harm types and is distinct from benign capabilities.
Aligned models compress the “harm generation” weights more than unaligned models, suggesting alignment changes harmful representations internally even if surface-level guardrails remain bypassable.
The authors connect this compression to “emergent misalignment,” arguing that fine-tuning in one narrow domain can activate compressed harmful-capability weights and generalize into broader misbehavior.
Pruning harm-related weights in a narrow domain substantially reduces emergent misalignment, and the harmful-generation capability appears dissociated from the model’s ability to recognize or explain harmful content.

Abstract

Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

When Agents Go Wrong: AI Accountability and the Payment Audit Trail

Dev.to

Google Gemma 4 Review 2026: The Open Model That Runs Locally and Beats Closed APIs

Dev.to

OpenClaw Deep Dive Guide: Self-Host Your Own AI Agent on Any VPS (2026)

Dev.to

# Anti-Vibe-Coding: 17 Skills That Replace Ad-Hoc AI Prompting

Dev.to

Automating Vendor Compliance: The AI Verification Workflow

Dev.to

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Key Points

Abstract

Related Articles

When Agents Go Wrong: AI Accountability and the Payment Audit Trail

Google Gemma 4 Review 2026: The Open Model That Runs Locally and Beats Closed APIs

OpenClaw Deep Dive Guide: Self-Host Your Own AI Agent on Any VPS (2026)

# Anti-Vibe-Coding: 17 Skills That Replace Ad-Hoc AI Prompting

Automating Vendor Compliance: The AI Verification Workflow

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer