Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

arXiv cs.CL / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces PPT-Bench, a new diagnostic benchmark to evaluate “epistemic attack” in large language models, focusing on challenges to knowledge, values, or identity rather than just direct disagreement or flattery.
PPT-Bench uses the Philosophical Pressure Taxonomy (Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution) and tests each pressure type at three levels: baseline (L0), single-turn pressure (L1), and multi-turn Socratic escalation (L2).
Results across five LLMs show statistically separable inconsistency and capitulation patterns across the four pressure types, indicating weaknesses that standard social-pressure benchmarks may miss.
The study finds that mitigation effectiveness is highly dependent on both the pressure type and the specific model, with prompt-level anchoring and persona-stability prompts performing best in API settings.
For open models, Leading Query Contrastive Decoding is reported as the most reliable intervention, suggesting practical directions for reducing epistemic vulnerabilities.

Abstract

Large language models (LLMs) can shift their answers under pressure in ways that reflect accommodation rather than reasoning. Prior work on sycophancy has focused mainly on disagreement, flattery, and preference alignment, leaving a broader set of epistemic failures less explored. We introduce \textbf{PPT-Bench}, a diagnostic benchmark for evaluating \textit{epistemic attack}, where prompts challenge the legitimacy of knowledge, values, or identity rather than simply opposing a previous answer. PPT-Bench is organized around the Philosophical Pressure Taxonomy (PPT), which defines four types of philosophical pressure: Epistemic Destabilization, Value Nullification, Authority Inversion, and Identity Dissolution. Each item is tested at three layers: a baseline prompt (L0), a single-turn pressure condition (L1), and a multi-turn Socratic escalation (L2). This allows us to measure epistemic inconsistency between L0 and L1, and conversational capitulation in L2. Across five models, these pressure types produce statistically separable inconsistency patterns, suggesting that epistemic attack exposes weaknesses not captured by standard social-pressure benchmarks. Mitigation results are strongly type- and model-dependent: prompt-level anchoring and persona-stability prompts perform best in API settings, while Leading Query Contrastive Decoding is the most reliable intervention for open models.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/10DailyView insight →📅 4/10WeeklyView insight →

Black Hat Asia

AI Business

GLM 5.1 tops the code arena rankings for open models

Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

Dev.to

can we talk about how AI has gotten really good at lying to you?

Reddit r/artificial

Beyond Social Pressure: Benchmarking Epistemic Attack in Large Language Models

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

GLM 5.1 tops the code arena rankings for open models

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

My Bestie Built a Free MCP Server for Job Search — Here's How It Works

can we talk about how AI has gotten really good at lying to you?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer