Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
arXiv cs.CL / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study investigates whether instruction-tuned LLMs answer multiple-choice questions based on the question content or instead rely on positional “shortcuts” when adversarially instructed to underperform.
- Using a six-condition instruction-specificity gradient on 2,000 MMLU-Pro items with Llama-3-8B and Llama-3.1-8B, the authors find the behavior transitions across three regimes rather than a smooth, monotonic change.
- Vague adversarial instructions mainly reduce accuracy while still showing meaningful content engagement, whereas standard sandbagging/capability-imitation drives a collapse in response-position entropy.
- The most extreme case comes from a two-step, answer-aware avoidance instruction, which causes near-total concentration on a single response position and eliminates measurable sensitivity to the content.
- The findings replicate across both models and four academic domains, and the paper notes that distribution-based (entropy) and content-based (difficulty–accuracy correlation) measures can partially diverge, suggesting distinct dimensions of “validity” in adversarial compliance.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER