Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs
arXiv cs.CL / 5/1/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- Perturbation probing is introduced as a two-forward-pass, no-backprop method to generate causal hypotheses about FFN neuron circuits in aligned LLMs.
- The approach identifies two major circuit structures across multiple models and architectures: opposition circuits (linked to RLHF suppressing pre-training tendencies) and routing circuits (linked to pre-training behaviors distributed through attention).
- For safety refusal behavior, roughly 50 neurons control the refusal template; ablating them changes about 80% of response formats on 520 AdvBench prompts while keeping harmful compliance near zero.
- For language selection, directional intervention can switch English-to-Chinese on 99.1% of 580 benchmark prompts in a subset of models, but the method fails broadly, defining practical limits of directional steering.
- A key metric, the FFN-to-skip signal ratio computed from the same two passes, distinguishes circuit types and helps predict when the intervention will work; circuit topology also varies by architecture (e.g., Qwen vs. Gemma).
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Text-to-image is easy. Chaining LLMs to generate, critique, and iterate on images autonomously is a routing nightmare. AgentSwarms now supports Image generation playground and creative media workflows!
Reddit r/artificial

Why Enterprise AI Pilots Fail
Dev.to

Automating FDA Compliance: AI for Specialty Food Producers
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to