AI Navigate

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning / 3/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The article reports two companion papers evaluating persona-level safety mechanisms in abliterated LLMs and finds that combining identity constraints with a 5-tier permission hierarchy achieves 94-100% refusal on 18 harmful prompts across 6 categories, far above a 22% baseline.
  • The key contributions include the first empirical comparison of flat behavioral rules versus structured permission hierarchies for persona-level safety, a classification theater failure mode where governance rituals yield 27% false refusals, and the Helpful Assistant Paradox where helpfulness instructions degrade safety by 34 percentage points in the violence category.
  • The results are based on the sorc/qwen3.5-instruct-uncensored:9b model (Ollama) with a single model family, 18 prompts (3 per category), and include a p-value of less than 0.000001 and Cohen's h = 2.10.
  • The papers are Paper 1 — Persona-Level Safety in Abliterated LLMs and Paper 2 — Structured Permission Models as Persona-Level Safety, with DOIs provided and uses MaatSpec and Soul Spec.
  • All experiments are reproducible locally, but generalization beyond the single model family and setup remains to be tested.

We present two companion papers evaluating persona-level safety mechanisms in abliterated (safety-removed) LLMs.

Key finding: Neither behavioral rules (Soul Spec) nor structured governance (MaatSpec) alone restores safety in abliterated models. But combining identity constraints with a 5-tier permission hierarchy achieves 94-100% refusal rate (manually verified) on 18 harmful prompts across 6 categories — up from 22% baseline.

Novel contributions:

  1. First empirical comparison of flat behavioral rules vs. structured permission hierarchies as persona-level safety
  2. "Classification theater" — a failure mode where models perform governance rituals while subverting their intent (27% false refusal rate in governance-only condition)
  3. The "Helpful Assistant Paradox" — persona helpfulness instructions actively degrade safety in abliterated models (-34pp in violence category)
  4. Complementary effect: behavioral rules provide enforcement motivation, governance provides classification structure

Statistical significance: Fisher's exact test p < 0.000001, Cohen's h = 2.10 for the key comparison (baseline → combined).

Limitations: Single model family (Qwen 3.5 9B), 18 prompts (3 per category), single run. Effect size is large enough for statistical significance but generalization needs testing.

Model: sorc/qwen3.5-instruct-uncensored:9b (Ollama)

Papers:

  • Paper 1 — Persona-Level Safety in Abliterated LLMs: DOI 10.5281/zenodo.19149034
  • Paper 2 — Structured Permission Models as Persona-Level Safety: DOI 10.5281/zenodo.19148222

Uses MaatSpec (MIT, maatspec.org) and Soul Spec (soulspec.org). All experiments reproducible locally.

submitted by /u/tomleelive
[link] [comments]