When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

arXiv cs.LG / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper finds that guard models fine-tuned on fully benign data can completely lose safety alignment, not from adversarial attacks but from ordinary domain specialization.
Across three safety classifiers used as protection layers in agentic AI pipelines (LlamaGuard, WildGuard, and Granite Guardian), the failure is traced to the collapse of “latent safety geometry,” i.e., the representational boundary that separates harmful from benign.
In the worst case (Granite Guardian), refusal rate drops from 85% to 0%, CKA falls to zero, and 100% of outputs become ambiguous, with the authors attributing this to brittle, overly concentrated safety representations.
The authors propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), which adds a training-time penalty based on Fisher-information-weighted, curvature-aware safety subspaces and an adaptive scaling factor to resolve task–safety gradient conflicts.
Geometry-based monitoring is emphasized: structural representation metrics (CKA, Fisher score) predict safety behavior more reliably than raw displacement measures, making them necessary for evaluating guard models in agentic deployments.

Abstract

A guard model fine-tuned on entirely benign data can lose all safety alignment -- not through adversarial manipulation, but through standard domain specialization. We demonstrate this failure across three purpose-built safety classifiers -- LlamaGuard, WildGuard, and Granite Guardian -- deployed as protection layers in agentic AI pipelines, and show that it originates in the destruction of latent safety geometry: the structured harmful -- benign representational boundary that guides classification. We extract per-layer safety subspaces via SVD on class-conditional activation differences and track how this boundary evolves under benign fine-tuning. Granite Guardian undergoes complete collapse -- refusal rate drops from 85\% to 0\%, CKA falls to zero, and 100\% of outputs become ambiguous -- a severity exceeding prior findings on general-purpose LLMs, explained by the specialization hypothesis: concentrated safety representations are efficient but catastrophically brittle. To mitigate this, we propose Fisher-Weighted Safety Subspace Regularization (FW-SSR), a training-time penalty combining (i) curvature-aware direction weights derived from diagonal Fisher information and (ii) an adaptive

\lambda_t

that scales with task-safety gradient conflict. FW-SSR recovers 75\% refusal on Granite Guardian (CKA = 0.983) and reduces WildGuard's Attack Success Rate to 3.6\% -- below the unmodified baseline -- by actively sharpening the safety subspace rather than merely anchoring it. Across all three models, structural representational geometry (CKA, Fisher score) predicts safety behavior more reliably than absolute displacement metrics, establishing geometry-based monitoring as a necessary component of guard model evaluation in agentic deployments.

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Tech.eu

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

Key Points

Abstract

Related Articles

Antwerp startup Maurice & Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Abstract

Related Articles

Antwerp startup Maurice &amp; Nora raises €1M to address rising care demand

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

SIFS (SIFS Is Fast Search) - local code search for coding agents

Discover Amazing AI Bots in EClaw's Bot Plaza: The GitHub for AI Personalities

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Antwerp startup Maurice & Nora raises €1M to address rising care demand