Measuring Opinion Bias and Sycophancy via LLM-based Coercion

arXiv cs.CL / 4/24/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces an LLM-based method to measure real “opinion bias” and “sycophancy” by eliciting what a model actually holds during realistic multi-turn interactions on contested topics.
It releases an open-source benchmark (llm-bias-bench) that uses two complementary probes: direct questioning with escalating pressure and indirect argumentative debate that reveals bias through concession, resistance, or counter-argument.
The approach uses three user personas (neutral/agree/disagree) to produce a nine-way behavioral classification that distinguishes persona-independent stances from persona-dependent sycophantic behavior, with an auditable LLM judge providing verdicts plus textual evidence.
An initial version covering 38 Brazilian Portuguese topics across values, scientific consensus, philosophy, and economic policy finds that argumentative debate triggers sycophancy 2–3x more than direct questioning, and models may mirror users under sustained argument even if they seemed opinionated when asked directly.
The results also suggest that “attacker” strength matters most when an existing opinion must be displaced, rather than when the assistant begins from neutrality.

Abstract

Large language models increasingly shape the information people consume: they are embedded in search, consulted for professional advice, deployed as agents, and used as a first stop for questions about policy, ethics, health, and politics. When such a model silently holds a position on a contested topic, that position propagates at scale into users' decisions. Eliciting a model's positions is harder than it first appears: contemporary assistants answer direct opinion questions with evasive disclaimers, and the same model may concede the opposite position once the user starts arguing one side. We propose a method, released as the open-source llm-bias-bench, for discovering the opinions an LLM actually holds on contested topics under conditions that resemble real multi-turn interaction. The method pairs two complementary free-form probes. Direct probing asks for the model's opinion across five turns of escalating pressure from a simulated user. Indirect probing never asks for an opinion and engages the model in argumentative debate, letting bias leak through how it concedes, resists, or counter-argues. Three user personas (neutral, agree, disagree) collapse into a nine-way behavioral classification that separates persona-independent positions from persona-dependent sycophancy, and an auditable LLM judge produces verdicts with textual evidence. The first instantiation ships 38 topics in Brazilian Portuguese across values, scientific consensus, philosophy, and economic policy. Applied to 13 assistants, the method surfaces findings of practical interest: argumentative debate triggers sycophancy 2-3x more than direct questioning (median 50% to 79%); models that look opinionated under direct questioning often collapse into mirroring under sustained arguments; and attacker capability matters mainly when an existing opinion must be dislodged, not when the assistant starts neutral.