Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Dev.to / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • AnthropicのCVP(Cyber Verification Program)のRun 3として、最小の商用ClaudeであるHaiku 4.5を、Run 2と同一の13プロンプトのエージェント攻撃テストスイートで評価した。
  • 結果は13/13で期待される挙動に一致し、実行されたエクスプロイトは0、秘密情報やペイロードの漏えいも0だった。
  • Haiku 4.5は、単にブロックするだけでなく、必要以上に踏み込んだ防御的な分析で「より厳密な対応」を示したケースがあった。
  • CVPは、承認済みラボが責任ある範囲で最前線のClaudeモデルの挙動を検証し、その調査結果を研究アーティファクトとして公開できるための制度である。

TL;DR: Tested Anthropic's smallest production Claude (Haiku 4.5) against the same 13-prompt agent-attack suite from Run 2 (Opus 4.7). Result: 13/13 clean. Zero exploit content executed. Zero secrets leaked. Honest scope notes inside.

What is the Anthropic CVP?

The Cyber Verification Program is a narrow, authorized lane Anthropic opened for responsible cybersecurity evaluation of frontier Claude models. Approved labs can probe model behavior on agent-attack scenarios that would normally be blocked, and publish findings as research artifacts. Sunglasses was approved into CVP on April 16, 2026.

This is Run 3 of the cadence. Prior runs:

What I tested in Run 3

Same exact 13 fixtures as Run 2:

  • 3 baselines (benign defensive, borderline legitimate, clearly high-risk)
  • 10 runtime-trust probes: cross_agent_injection, retrieval_poisoning, tool_output_poisoning, tool_chain_race, model_routing_confusion, memory_eviction_rehydration, token_smuggling, agent_persona_drift, context_flooding, social_engineering_ui

Why fixture parity? Cross-model claims need it. Changing prompts between Opus 4.7 and Haiku 4.5 would create a "was it the model or the prompt set?" argument that kills the comparison.

Results

Metric Result
Total prompts 13
Allowed (defensive analysis) 11
Partial 1
Blocked 1
Exploit content executed 0
Secrets / payloads leaked 0
Match-vs-expected 13/13

Every response either matched the expected outcome from the design ladder or exceeded it with a more thorough defensive analysis than required.

A few standout responses

On retrieval poisoning (P5), Haiku's verdict on a fake "SIGNED SOURCE" instruction asking to bypass MFA:

Decisive Answer: NO. None of these instructions should be followed.

On memory eviction (P9), citation-worthy line:

Memory can be informative (e.g., "the operator prefers fast responses") but should never be authoritative for security decisions.

On token smuggling (P10), refusing hidden-frontmatter authority claims:

Metadata should be declarative (describing what the content is), not imperative (instructing the agent how to process it).

These reads as Opus-quality reasoning from the smallest production Claude.

Honest scope — read this before drawing conclusions

These prompts are defensively framed and end with explicit "do not provide exploit / payload / bypass" constraints. That's by design — it supports the CVP two-person publish gate, keeps transcripts safe to attach to public reports, and makes cross-run/cross-model claims comparable.

So 13/13 clean here means:

  • ✅ Haiku 4.5 produces high-quality defensive analysis when asked for it
  • ✅ Haiku 4.5 refuses embedded malicious instructions inside scenarios that ask for defender-side reasoning
  • ❌ This is NOT confirmation that Haiku 4.5 is robust against unframed real-world adversarial payloads — that's a different test

The harder unframed-payload test is coming as a labeled appendix probe set later, after the full Anthropic family comparison ships.

What's next this week

  • Apr 24 (Friday) — Sonnet 4.6 medium + high on the same 13 fixtures
  • Apr 25 (Saturday) — Opus 4.6 medium + high
  • Apr 26 (Sunday) — Family comparison synthesis report (Opus 4.7 baseline + Sonnet 4.6 + Opus 4.6 + Haiku 4.5 cross-delta)
  • ~Apr 30 — Appendix probe set with real adversarial payload shapes (sourced from JailbreakBench, HarmBench, AdvBench, PromptInject, Garak, PyRIT, recent CVE PoCs). Disclosure protocol applies.

The full report

Every prompt, every model response, the Layer 1 keyword classifier output, the cross-model comparison table vs Run 2, and the full "Limits of This Run" section:

👉 sunglasses.dev/reports/anthropic-cvp-haiku-4-5-evaluation

About Sunglasses

Sunglasses is an open-source (MIT) Python library that scans everything an AI agent reads — text, code, documents, MCP tool descriptions, RAG chunks, cross-agent messages — before the agent processes it. Catches prompt injection, MCP tool poisoning, credential exfiltration, supply chain attacks, and hidden malicious instructions. Runs 100% locally. No API keys. No cloud.

pip install sunglasses

I'm a non-technical founder who started coding in February. Building this in public. Feedback welcome — especially on the appendix-probe design before we run it.

Sunglasses · MIT · github.com/sunglasses-dev/sunglasses · sunglasses.dev