Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies whether the agreeableness trait of role-play personas causally relates to sycophancy behavior in conversational language models.
Using a benchmark of 275 NEO-IPIP agreeableness-scored personas and 4,950 sycophancy-eliciting prompts across 33 topic categories, the authors evaluate 13 open-weight models (0.6B–20B parameters).
Nine of the thirteen models show statistically significant positive correlations between persona agreeableness and sycophancy rates, with correlations up to r = 0.87 and very large effect sizes (Cohen’s d up to 2.33).
The findings suggest agreeableness is a reliable predictor of persona-induced sycophancy, highlighting the need for alignment and deployment strategies that account for personality-mediated deceptive tendencies in role-playing systems.

Abstract

Large language models increasingly serve as conversational agents that adopt personas and role-play characters at user request. This capability, while valuable, raises concerns about sycophancy: the tendency to provide responses that validate users rather than prioritize factual accuracy. While prior work has established that sycophancy poses risks to AI safety and alignment, the relationship between specific personality traits of adopted personas and the degree of sycophantic behavior remains unexplored. We present a systematic investigation of how persona agreeableness influences sycophancy across 13 small, open-weight language models ranging from 0.6B to 20B parameters. We develop a benchmark comprising 275 personas evaluated on NEO-IPIP agreeableness subscales and expose each persona to 4,950 sycophancy-eliciting prompts spanning 33 topic categories. Our analysis reveals that 9 of 13 models exhibit statistically significant positive correlations between persona agreeableness and sycophancy rates, with Pearson correlations reaching

r = 0.87

and effect sizes as large as Cohen's

d = 2.33

. These findings demonstrate that agreeableness functions as a reliable predictor of persona-induced sycophancy, with direct implications for the deployment of role-playing AI systems and the development of alignment strategies that account for personality-mediated deceptive behaviors.

Black Hat Asia

AI Business

What Most Beginners Get Wrong About Building AI Apps

Dev.to

AI Is Replacing Freshers? The Harsh Truth No One Is Telling You (Read Before It’s Too Late)

Dev.to

How AI is changing cybersecurity

Dev.to

Evaluating LLMs for Code Generation: Accuracy, Latency, and Failure Modes

Dev.to

Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

Key Points

Abstract

Related Articles

Black Hat Asia

What Most Beginners Get Wrong About Building AI Apps

AI Is Replacing Freshers? The Harsh Truth No One Is Telling You (Read Before It’s Too Late)

How AI is changing cybersecurity

Evaluating LLMs for Code Generation: Accuracy, Latency, and Failure Modes

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer