CivicShield: A Cross-Domain Defense-in-Depth Framework for Securing Government-Facing AI Chatbots Against Multi-Turn Adversarial Attacks

arXiv cs.AI / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies major security shortcomings in government-facing LLM chatbots, noting that multi-turn adversarial attacks can exceed 90% success and commonly bypass single-layer guardrails.
It proposes “CivicShield,” a defense-in-depth framework that combines seven layers spanning zero-trust capability access control, input validation, semantic intent filtering, conversation state machine invariants, anomaly detection, multi-model consensus, and graduated human escalation.
The authors develop a formal threat model covering eight multi-turn attack families and map CivicShield to NIST SP 800-53 controls across 14 control families to support government compliance needs.
Evaluation across 1,436 simulated scenarios using benchmarks such as HarmBench, JailbreakBench, and XSTest reports 72.9% combined detection with a 2.9% effective false positive rate, while preserving 100% detection for crescendo and slow-drift multi-turn attacks.
Independent benchmark comparisons show reduced performance on real datasets versus author-generated scenarios, reinforcing the need for independently validated evaluation for practical deployment.

Abstract

LLM-based chatbots in government services face critical security gaps. Multi-turn adversarial attacks achieve over 90% success against current defenses, and single-layer guardrails are bypassed with similar rates. We present CivicShield, a cross-domain defense-in-depth framework for government-facing AI chatbots. Drawing on network security, formal verification, biological immune systems, aviation safety, and zero-trust cryptography, CivicShield introduces seven defense layers: (1) zero-trust foundation with capability-based access control, (2) perimeter input validation, (3) semantic firewall with intent classification, (4) conversation state machine with safety invariants, (5) behavioral anomaly detection, (6) multi-model consensus verification, and (7) graduated human-in-the-loop escalation. We present a formal threat model covering 8 multi-turn attack families, map the framework to NIST SP 800-53 controls across 14 families, and evaluate using ablation analysis. Theoretical analysis shows layered defenses reduce attack probability by 1-2 orders of magnitude versus single-layer approaches. Simulation against 1,436 scenarios including HarmBench (416), JailbreakBench (200), and XSTest (450) achieves 72.9% combined detection [69.5-76.0% CI] with 2.9% effective false positive rate after graduated response, while maintaining 100% detection of multi-turn crescendo and slow-drift attacks. The honest drop on real benchmarks versus author-generated scenarios (71.2% vs 76.7% on HarmBench, 47.0% vs 70.0% on JailbreakBench) validates independent evaluation importance. CivicShield addresses an open gap at the intersection of AI safety, government compliance, and practical deployment.