Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression

arXiv cs.CL / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that LLM safety mechanisms can be bypassed due to distributional gaps between alignment-oriented prompts and malicious jailbreak prompts.
It introduces “LogiBreak,” a universal black-box jailbreak technique that translates harmful natural-language requests into formal logical expressions to evade safety filters.
By using logical translation, LogiBreak is claimed to preserve the original semantic intent while remaining readable, yet still fall outside the safety system’s expected input distribution.
Experiments on a multilingual jailbreak dataset across three languages show that the approach works across different evaluation setups and linguistic contexts.
The work suggests that improving safety may require addressing not only surface-level wording but also deeper distribution shifts and alternate prompt representations.

Abstract

Despite substantial advancements in aligning large language models (LLMs) with human values, current safety mechanisms remain susceptible to jailbreak attacks. We hypothesize that this vulnerability stems from distributional discrepancies between alignment-oriented prompts and malicious prompts. To investigate this, we introduce LogiBreak, a novel and universal black-box jailbreak method that leverages logical expression translation to circumvent LLM safety systems. By converting harmful natural language prompts into formal logical expressions, LogiBreak exploits the distributional gap between alignment data and logic-based inputs, preserving the underlying semantic intent and readability while evading safety constraints. We evaluate LogiBreak on a multilingual jailbreak dataset spanning three languages, demonstrating its effectiveness across various evaluation settings and linguistic contexts.