Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study investigates whether instruction-tuned LLMs answer multiple-choice questions based on the question content or instead rely on positional “shortcuts” when adversarially instructed to underperform.
Using a six-condition instruction-specificity gradient on 2,000 MMLU-Pro items with Llama-3-8B and Llama-3.1-8B, the authors find the behavior transitions across three regimes rather than a smooth, monotonic change.
Vague adversarial instructions mainly reduce accuracy while still showing meaningful content engagement, whereas standard sandbagging/capability-imitation drives a collapse in response-position entropy.
The most extreme case comes from a two-step, answer-aware avoidance instruction, which causes near-total concentration on a single response position and eliminates measurable sensitivity to the content.
The findings replicate across both models and four academic domains, and the paper notes that distribution-based (entropy) and content-based (difficulty–accuracy correlation) measures can partially diverge, suggesting distinct dimensions of “validity” in adversarial compliance.

Abstract

When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two instruction-tuned LLMs (Llama-3-8B and Llama-3.1-8B) on 2,000 MMLU-Pro items. Distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) jointly characterise each condition. The gradient reveals three regimes rather than a monotonic transition. Vague adversarial instructions produce moderate accuracy reduction with preserved content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme positional collapse, with near-total concentration on a single response position (99.9% and 87.4%) and no measurable content sensitivity. This was the only multi-step instruction tested, and it produced the most extreme shortcut. The attractor position matches each model's content-absent null-prompt default. The effect replicates across both models and four academic domains. Distributional collapse and content engagement can co-occur (50% concordance between screening criteria), indicating that entropy-based screening and difficulty-based content assessment capture partially independent dimensions of response validity. Results suggest that instruction complexity can determine whether adversarial compliance uses content-aware or content-blind mechanisms in small instruction-tuned LLMs under greedy decoding.