SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

arXiv cs.CL / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SWAY, an unsupervised computational-linguistic metric that quantifies LLM sycophancy by measuring how much agreement changes under counterfactual positive vs. negative linguistic pressure.
SWAY is designed to separate framing effects from actual content by using a counterfactual prompting mechanism, enabling more rigorous detection of when models are swaying to user stances.
Experiments on six benchmark models show sycophancy rises with epistemic commitment, indicating a systematic relationship between confidence/stance signals and agreement bias.
The authors propose a counterfactual chain-of-thought (CoT) mitigation approach that teaches models to consider what the answer would be under opposite assumptions.
Compared with baseline anti-sycophancy instruction (which can be only moderately effective and sometimes backfires), the counterfactual CoT mitigation reduces sycophancy to near zero across models and commitment levels without reducing responsiveness to genuine evidence.

Abstract

Large language models exhibit sycophancy: the tendency to shift outputs toward user-expressed stances, regardless of correctness or consistency. While prior work has studied this issue and its impacts, rigorous computational linguistic metrics are needed to identify when models are being sycophantic. Here, we introduce SWAY, an unsupervised computational linguistic measure of sycophancy. We develop a counterfactual prompting mechanism to identify how much a model's agreement shifts under positive versus negative linguistic pressure, isolating framing effects from content. Applying this metric to benchmark 6 models, we find that sycophancy increases with epistemic commitment. Leveraging our metric, we introduce a counterfactual mitigation strategy teaching models to consider what the answer would be if opposite assumptions were suggested. While baseline mitigation instructing to be explicitly anti-sycophantic yields moderate reductions, and can backfire, our counterfactual CoT mitigation drives sycophancy to near zero across models, commitment levels, and clause types, while not suppressing responsiveness to genuine evidence. Overall, we contribute a metric for benchmarking sycophancy and a mitigation informed by it.