Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • A follow-up arXiv study tests whether Llama-3-8B’s “prompted sandbagging” behaves as positional collapse (not answer avoidance), using cyclic randomisation of option order as a key control.
  • The same-letter diagnostic did not show deterministic position-tracking (same-letter rate 37.3%), but the response-position distribution remained extremely stable under complete content rotation (Pearson r = 0.9994).
  • Accuracy shows a strong position effect: it rises to 72.1% when the correct option lands in the model’s preferred position E, and drops to 4.3% at position A.
  • The authors argue there is a “soft distributional attractor” under sandbagging instructions, forming a low-entropy response-position basin centered around E/F/G and largely invariant to content at the aggregate level.
  • Qwen-2.5-7B is used as a negative control and does not display a comparable distributional shift, supporting the claim that the effect is specific to the sandbagging mode.
  • The paper suggests that response-position entropy could serve as a useful black-box behavioral signature for detecting this sandbagging regime at the 7–9B parameter scale.

Abstract

A predecessor pilot (Cacioli, 2026) found that Llama-3-8B implements prompted sandbagging as positional collapse rather than answer avoidance. However, fixed option ordering in MMLU-Pro left open whether this reflected a model-level position-dominant policy or dataset-level distractor structure. This pre-registered follow-up (3 models, 2,000 MMLU-Pro items, 4 conditions, 24,000 primary trials) added cyclic option-order randomisation as the critical control. The pre-registered item-level same-letter diagnostic did not confirm deterministic position-tracking (same-letter rate 37.3%, below the 50% threshold). However, pre-specified supporting analyses revealed that the response-position distribution under sandbagging was highly stable under complete content rotation (Pearson r = 0.9994; Jensen-Shannon divergence = 0.027, compared to 0.386 between honest and sandbagging conditions). Accuracy spiked to 72.1% when the correct answer coincidentally occupied the preferred position E, and fell to 4.3% at position A. The data provide strong evidence for a soft distributional attractor: under sandbagging instruction, the model enters a low-entropy response-position basin centred on E/F/G that is highly stable and largely content-invariant at the aggregate level. Qwen-2.5-7B served as a negative control (non-compliant, no distributional shift). These results provide evidence, at the 7-9 billion parameter scale, that response-position entropy is a promising black-box behavioural signature of this sandbagging mode.