GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models
arXiv cs.AI / 4/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how 9 jailbreak attacks affect 7 small language models (SLMs) and 3 large language models (LLMs), finding that SLMs remain highly vulnerable to prompts that bypass safety alignment.
- It analyzes hidden-layer activations across different layers and architectures, showing that different input types produce distinguishable internal representation patterns that relate to jailbreak behavior.
- The authors propose GUARD-SLM, a lightweight token activation-based defense that filters malicious prompts directly in the representation space during inference while preserving benign requests.
- The work highlights limitations in existing jailbreak defenses’ robustness across heterogeneous attacks and offers a practical pathway for improving secure deployment of SLMs on resource-constrained environments.
Related Articles

Day 6: I Stopped Writing Articles and Started Hunting Bounties
Dev.to

Early Detection of Breast Cancer using SVM Classifier Technique
Dev.to

I Started Writing for Others. It Changed How I Learn.
Dev.to

10 лучших курсов по prompt engineering бесплатно: секреты успеха пошагово!
Dev.to

Prompt Engineering at Workplace: How I Used Amazon Q Developer to Boost Team Productivity by 30%
Dev.to