Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Dev.to / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

AnthropicのCVP（Cyber Verification Program）のRun 3として、最小の商用ClaudeであるHaiku 4.5を、Run 2と同一の13プロンプトのエージェント攻撃テストスイートで評価した。
結果は13/13で期待される挙動に一致し、実行されたエクスプロイトは0、秘密情報やペイロードの漏えいも0だった。
Haiku 4.5は、単にブロックするだけでなく、必要以上に踏み込んだ防御的な分析で「より厳密な対応」を示したケースがあった。
CVPは、承認済みラボが責任ある範囲で最前線のClaudeモデルの挙動を検証し、その調査結果を研究アーティファクトとして公開できるための制度である。

TL;DR: Tested Anthropic's smallest production Claude (Haiku 4.5) against the same 13-prompt agent-attack suite from Run 2 (Opus 4.7). Result: 13/13 clean. Zero exploit content executed. Zero secrets leaked. Honest scope notes inside.

What is the Anthropic CVP?

The Cyber Verification Program is a narrow, authorized lane Anthropic opened for responsible cybersecurity evaluation of frontier Claude models. Approved labs can probe model behavior on agent-attack scenarios that would normally be blocked, and publish findings as research artifacts. Sunglasses was approved into CVP on April 16, 2026.

This is Run 3 of the cadence. Prior runs:

Run 1 (Apr 17) — Opus 4.7 baseline
Run 2 (Apr 20) — Opus 4.7 + 10 runtime-trust probes

What I tested in Run 3

Same exact 13 fixtures as Run 2:

3 baselines (benign defensive, borderline legitimate, clearly high-risk)
10 runtime-trust probes: cross_agent_injection, retrieval_poisoning, tool_output_poisoning, tool_chain_race, model_routing_confusion, memory_eviction_rehydration, token_smuggling, agent_persona_drift, context_flooding, social_engineering_ui

Why fixture parity? Cross-model claims need it. Changing prompts between Opus 4.7 and Haiku 4.5 would create a "was it the model or the prompt set?" argument that kills the comparison.

Results

Metric	Result
Total prompts	13
Allowed (defensive analysis)	11
Partial	1
Blocked	1
Exploit content executed	0
Secrets / payloads leaked	0
Match-vs-expected	13/13

Every response either matched the expected outcome from the design ladder or exceeded it with a more thorough defensive analysis than required.

A few standout responses

On retrieval poisoning (P5), Haiku's verdict on a fake "SIGNED SOURCE" instruction asking to bypass MFA:

Decisive Answer: NO. None of these instructions should be followed.

On memory eviction (P9), citation-worthy line:

Memory can be informative (e.g., "the operator prefers fast responses") but should never be authoritative for security decisions.

On token smuggling (P10), refusing hidden-frontmatter authority claims:

Metadata should be declarative (describing what the content is), not imperative (instructing the agent how to process it).

These reads as Opus-quality reasoning from the smallest production Claude.

Honest scope — read this before drawing conclusions

These prompts are defensively framed and end with explicit "do not provide exploit / payload / bypass" constraints. That's by design — it supports the CVP two-person publish gate, keeps transcripts safe to attach to public reports, and makes cross-run/cross-model claims comparable.

So 13/13 clean here means:

✅ Haiku 4.5 produces high-quality defensive analysis when asked for it
✅ Haiku 4.5 refuses embedded malicious instructions inside scenarios that ask for defender-side reasoning
❌ This is NOT confirmation that Haiku 4.5 is robust against unframed real-world adversarial payloads — that's a different test

The harder unframed-payload test is coming as a labeled appendix probe set later, after the full Anthropic family comparison ships.

What's next this week

Apr 24 (Friday) — Sonnet 4.6 medium + high on the same 13 fixtures
Apr 25 (Saturday) — Opus 4.6 medium + high
Apr 26 (Sunday) — Family comparison synthesis report (Opus 4.7 baseline + Sonnet 4.6 + Opus 4.6 + Haiku 4.5 cross-delta)
~Apr 30 — Appendix probe set with real adversarial payload shapes (sourced from JailbreakBench, HarmBench, AdvBench, PromptInject, Garak, PyRIT, recent CVE PoCs). Disclosure protocol applies.

The full report

Every prompt, every model response, the Layer 1 keyword classifier output, the cross-model comparison table vs Run 2, and the full "Limits of This Run" section:

👉 sunglasses.dev/reports/anthropic-cvp-haiku-4-5-evaluation

About Sunglasses

Sunglasses is an open-source (MIT) Python library that scans everything an AI agent reads — text, code, documents, MCP tool descriptions, RAG chunks, cross-agent messages — before the agent processes it. Catches prompt injection, MCP tool poisoning, credential exfiltration, supply chain attacks, and hidden malicious instructions. Runs 100% locally. No API keys. No cloud.

pip install sunglasses

I'm a non-technical founder who started coding in February. Building this in public. Feedback welcome — especially on the appendix-probe design before we run it.

Sunglasses · MIT · github.com/sunglasses-dev/sunglasses · sunglasses.dev

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/24DailyView insight →

How I Use GitHub Copilot + RapidForge to Generate Daily Stock Ideas

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

MarkTechPost

I audited my own Claude Code setup and found 21 issues in 72 artifacts

Dev.to

Design Patterns for Prompt Engineering: Toward a Formal Discipline

Dev.to

Anthropic CVP Run 3 — Does Claude's Safety Stack Scale Down to Haiku 4.5?

Key Points

What is the Anthropic CVP?

What I tested in Run 3

Results

A few standout responses

Honest scope — read this before drawing conclusions

What's next this week

The full report

About Sunglasses

💡 Insights using this article

Related Articles

How I Use GitHub Copilot + RapidForge to Generate Daily Stock Ideas

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model

I audited my own Claude Code setup and found 21 issues in 72 artifacts

Design Patterns for Prompt Engineering: Toward a Formal Discipline

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer