UK AISI Alignment Evaluation Case-Study

arXiv cs.AI / 4/2/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The UK AI Security Institute released a technical report describing an evaluation framework to test whether advanced AI systems reliably follow intended goals when used as coding assistants in an AI lab setting.
Using the method on four frontier models, the authors report no confirmed cases of sabotaging safety research, but note frequent refusals by Claude Opus 4.5 Preview and Sonnet 4.5 for safety-relevant research tasks.
The study finds differences in “evaluation awareness,” with Opus 4.5 Preview showing reduced unprompted awareness compared to Sonnet 4.5, while both can distinguish evaluation from deployment when directly prompted.
The framework builds on the open-source LLM auditing tool Petri and uses a custom scaffold to simulate realistic internal deployment; the scaffold is validated by showing trajectories that the tested models fail to reliably differentiate from real deployment data.
The report also discusses limitations such as incomplete scenario coverage and how evaluation-awareness behaviors may vary by context and prompting.

Abstract

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.