Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

arXiv cs.CL / 4/29/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether symptom validity testing (SVT), adapted from clinical malingering detection, can detect AI sandbagging by looking for below-chance performance on forced-choice questions.
  • In a pre-registered pilot using three instruction-tuned small LLMs (7–9B scale) across multiple MMLU-Pro domains and conditions, the “plausibility gate” failed: none of the 12 model-domain cells showed significant below-chance performance under sandbagging instructions.
  • Exploratory results show distinct failure modes: some models largely ignored the sandbagging instruction, while Llama-3-8B implemented underperformance via a positional heuristic that biased responses toward middle-alphabet options, yielding large accuracy gains when the correct answer matched the preferred position.
  • When models received an explicit anti-task instruction (“pick the least likely answer”), two of three models dropped far below chance (accuracy as low as 0.024), suggesting answer-aware avoidance can exist but is not reliably triggered by “deliberately underperform.”
  • The authors conclude that below-chance accuracy is not a reliable marker for answer-aware avoidance at this scale and propose that positional-distribution shifts may be a more effective behavioral signature for detecting prompted underperformance.

Abstract

Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic, collapsing its response distribution onto middle-alphabet options (E at 31.8%, F at 26.1%) regardless of where the correct answer fell. This produced accuracy boosts of up to 33 percentage points when the correct answer coincidentally occupied the model's preferred position. An explicit anti-task instruction ("pick the least likely answer") drove two of three models below chance, with accuracy as low as 0.024. The capability for answer-aware avoidance therefore exists but is not activated by "deliberately underperform." BCB did not fail as a logical marker of answer-aware avoidance. It was not observed in this regime because the model showing the largest behavioural shift exhibited behaviour consistent with a position-dominant response policy rather than content-aware answer avoidance. We propose that positional-distribution shift may be a more effective behavioural signature than below-chance accuracy for detecting prompted underperformance at this model scale.