Is Evaluation Awareness Just Format Sensitivity? Limitations of Probe-Based Evidence under Controlled Prompt Structure
arXiv cs.AI / 3/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper tests whether probe-based signals of evaluation awareness in large language models are confounded by prompt/benchmark structure by using a controlled 2x2 dataset and diagnostic rewrites.
- It finds that probes primarily track benchmark-canonical structure and do not generalize to free-form prompts independent of linguistic style.
- Consequently, standard probe-based methodologies do not reliably disentangle evaluation context from surface artifacts, limiting the evidential strength of existing results.
- The work implies a need for more robust evaluation methods that separate context from prompt structure when assessing evaluation awareness in LLMs.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA