Jailbreaking Large Language Models with Morality Attacks

arXiv cs.CL / 4/21/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a new study that uses jailbreak attacks to probe how LLMs internalize and apply pluralistic moral values.
It constructs a morality dataset (10.3K instances) covering two challenge types: value ambiguity and value conflict.
The authors formalize four adversarial attack methods to manipulate LLMs’ judgments on morality-related questions.
Experiments evaluate both base LLMs and “guardrail” models used in generative systems, finding a critical vulnerability to these moral-aware attacks.

Abstract

Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large language models (LLMs) to accomplish pluralism. Although this is essential, the robustness of LLMs to produce moral content over pluralistic values is still under exploration.Inspired by the astonishing persuasion abilities via jailbreak prompts, we propose to leverage jailbreak attacks to study LLMs' internal pluralistic values. In detail, we develop a morality dataset with 10.3K instances in two categories: Value Ambiguity and Value Conflict. We further formalize four adversarial attacks with the constructed dataset, to manipulate LLMs' judgment over the morality questions. We evaluate both the large language models and guardrail models which are typically used in generative systems with flexible user input. Our experiment results show that there is a critical vulnerability of LLMs and guardrail models to these subtle and sophisticated moral-aware attacks.

The 2026 Forbes AI 50 List

Reddit r/artificial

Add cryptographic authorization to AI agents in 5 minutes

Dev.to

Supercharging Your CI/CD: Integrating TestSprite AI Testing with GitHub Actions

Dev.to

Claude and I aren't vibing at all

Dev.to

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

Dev.to

Jailbreaking Large Language Models with Morality Attacks

Key Points

Abstract

Related Articles

The 2026 Forbes AI 50 List

Add cryptographic authorization to AI agents in 5 minutes

Supercharging Your CI/CD: Integrating TestSprite AI Testing with GitHub Actions

Claude and I aren't vibing at all

The ULTIMATE Guide to AI Voice Cloning: RVC WebUI (Zero to Hero)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer