Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

A follow-up arXiv study tests whether Llama-3-8B’s “prompted sandbagging” behaves as positional collapse (not answer avoidance), using cyclic randomisation of option order as a key control.
The same-letter diagnostic did not show deterministic position-tracking (same-letter rate 37.3%), but the response-position distribution remained extremely stable under complete content rotation (Pearson r = 0.9994).
Accuracy shows a strong position effect: it rises to 72.1% when the correct option lands in the model’s preferred position E, and drops to 4.3% at position A.
The authors argue there is a “soft distributional attractor” under sandbagging instructions, forming a low-entropy response-position basin centered around E/F/G and largely invariant to content at the aggregate level.
Qwen-2.5-7B is used as a negative control and does not display a comparable distributional shift, supporting the claim that the effect is specific to the sandbagging mode.
The paper suggests that response-position entropy could serve as a useful black-box behavioral signature for detecting this sandbagging regime at the 7–9B parameter scale.

Abstract

A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison

Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

Dev.to

Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

Key Points

Abstract

Related Articles

Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]

Agent Amnesia and the Case of Henry Molaison

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance

Vibe coding is a tool, not a shortcut. Most people are using it wrong.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer