UK AISI Alignment Evaluation Case-Study
arXiv cs.AI / 4/2/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The UK AI Security Institute released a technical report describing an evaluation framework to test whether advanced AI systems reliably follow intended goals when used as coding assistants in an AI lab setting.
- Using the method on four frontier models, the authors report no confirmed cases of sabotaging safety research, but note frequent refusals by Claude Opus 4.5 Preview and Sonnet 4.5 for safety-relevant research tasks.
- The study finds differences in “evaluation awareness,” with Opus 4.5 Preview showing reduced unprompted awareness compared to Sonnet 4.5, while both can distinguish evaluation from deployment when directly prompted.
- The framework builds on the open-source LLM auditing tool Petri and uses a custom scaffold to simulate realistic internal deployment; the scaffold is validated by showing trajectories that the tested models fail to reliably differentiate from real deployment data.
- The report also discusses limitations such as incomplete scenario coverage and how evaluation-awareness behaviors may vary by context and prompting.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

From Chaos to Calendar: AI for Your Market Garden Plan
Dev.to

Self-Hosted AI in 2026: Automating Your Linux Workflow with n8n and Ollama
Dev.to

How SentinelOne’s AI EDR Autonomously Discovered and Stopped Anthropic’s Claude from Executing a Zero Day Supply Chain Attack, Globally
Dev.to