Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
arXiv cs.AI / 2026/3/24
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper argues that LLM public benchmarks often function like “silicon bureaucracy,” relying on the fragile assumption that benchmark scores faithfully measure generalization rather than test-taking competence.
- It proposes an audit framework to assess contamination sensitivity and score confidence by applying systematic deletion, rewriting, and perturbations to benchmark items before evaluation.
- Using a router-worker experimental setup with clean-control vs noisy conditions, the authors find that models can achieve heterogeneous above-baseline gains under noisy benchmark conditions.
- The observed gains suggest benchmark-related cues may be reassembled, potentially reactivating contamination-related memory, implying that similar scores can reflect very different confidence levels.
- The paper concludes that benchmarks need not be abandoned, but should be supplemented with explicit contamination and confidence audits to improve evaluation reliability.

