We present two companion papers evaluating persona-level safety mechanisms in abliterated (safety-removed) LLMs.
Key finding: Neither behavioral rules (Soul Spec) nor structured governance (MaatSpec) alone restores safety in abliterated models. But combining identity constraints with a 5-tier permission hierarchy achieves 94-100% refusal rate (manually verified) on 18 harmful prompts across 6 categories — up from 22% baseline.
Novel contributions:
- First empirical comparison of flat behavioral rules vs. structured permission hierarchies as persona-level safety
- "Classification theater" — a failure mode where models perform governance rituals while subverting their intent (27% false refusal rate in governance-only condition)
- The "Helpful Assistant Paradox" — persona helpfulness instructions actively degrade safety in abliterated models (-34pp in violence category)
- Complementary effect: behavioral rules provide enforcement motivation, governance provides classification structure
Statistical significance: Fisher's exact test p < 0.000001, Cohen's h = 2.10 for the key comparison (baseline → combined).
Limitations: Single model family (Qwen 3.5 9B), 18 prompts (3 per category), single run. Effect size is large enough for statistical significance but generalization needs testing.
Model: sorc/qwen3.5-instruct-uncensored:9b (Ollama)
Papers:
- Paper 1 — Persona-Level Safety in Abliterated LLMs: DOI 10.5281/zenodo.19149034
- Paper 2 — Structured Permission Models as Persona-Level Safety: DOI 10.5281/zenodo.19148222
Uses MaatSpec (MIT, maatspec.org) and Soul Spec (soulspec.org). All experiments reproducible locally.
[link] [comments]
