Is my model perplexed for the right reason? Contrasting LLMs' Benchmark Behavior with Token-Level Perplexity
arXiv cs.CL / 4/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that standard LLM benchmark scores don’t reveal whether models’ correct answers come from the intended underlying linguistic mechanisms, risking confirmation bias.
- It proposes an interpretability framework using token-level perplexity distributions over minimal sentence pairs that differ by a few “pivotal” tokens.
- The approach is designed to support hypothesis-driven analysis while avoiding unstable feature-attribution methods.
- Experiments on controlled linguistic benchmarks with multiple open-weight LLMs find that linguistically important tokens affect behavior, but do not fully account for observed perplexity shifts.
- The results suggest LLMs rely on additional heuristics beyond the expected linguistic cues, motivating further investigation into hidden factors driving benchmark performance.
Related Articles

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Dev.to

I Built an AI Agent That Can Write Its Own Tools When It Gets Stuck
Dev.to

Agent Self-Discovery: How AI Agents Find Their Own Wallets
Dev.to
[P] Federated Adversarial Learning
Reddit r/MachineLearning

The Inversion Error: Why Safe AGI Requires an Enactive Floor and State-Space Reversibility
Towards Data Science