[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Reddit r/MachineLearning / 3/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article reports two companion papers evaluating persona-level safety mechanisms in abliterated LLMs and finds that combining identity constraints with a 5-tier permission hierarchy achieves 94-100% refusal on 18 harmful prompts across 6 categories, far above a 22% baseline.
The key contributions include the first empirical comparison of flat behavioral rules versus structured permission hierarchies for persona-level safety, a classification theater failure mode where governance rituals yield 27% false refusals, and the Helpful Assistant Paradox where helpfulness instructions degrade safety by 34 percentage points in the violence category.
The results are based on the sorc/qwen3.5-instruct-uncensored:9b model (Ollama) with a single model family, 18 prompts (3 per category), and include a p-value of less than 0.000001 and Cohen's h = 2.10.
The papers are Paper 1 — Persona-Level Safety in Abliterated LLMs and Paper 2 — Structured Permission Models as Persona-Level Safety, with DOIs provided and uses MaatSpec and Soul Spec.
All experiments are reproducible locally, but generalization beyond the single model family and setup remains to be tested.

We present two companion papers evaluating persona-level safety mechanisms in abliterated (safety-removed) LLMs.

Key finding: Neither behavioral rules (Soul Spec) nor structured governance (MaatSpec) alone restores safety in abliterated models. But combining identity constraints with a 5-tier permission hierarchy achieves 94-100% refusal rate (manually verified) on 18 harmful prompts across 6 categories — up from 22% baseline.

Novel contributions:

First empirical comparison of flat behavioral rules vs. structured permission hierarchies as persona-level safety
"Classification theater" — a failure mode where models perform governance rituals while subverting their intent (27% false refusal rate in governance-only condition)
The "Helpful Assistant Paradox" — persona helpfulness instructions actively degrade safety in abliterated models (-34pp in violence category)
Complementary effect: behavioral rules provide enforcement motivation, governance provides classification structure

Statistical significance: Fisher's exact test p < 0.000001, Cohen's h = 2.10 for the key comparison (baseline → combined).

Limitations: Single model family (Qwen 3.5 9B), 18 prompts (3 per category), single run. Effect size is large enough for statistical significance but generalization needs testing.

Model: sorc/qwen3.5-instruct-uncensored:9b (Ollama)

Papers:

Paper 1 — Persona-Level Safety in Abliterated LLMs: DOI 10.5281/zenodo.19149034
Paper 2 — Structured Permission Models as Persona-Level Safety: DOI 10.5281/zenodo.19148222

Uses MaatSpec (MIT, maatspec.org) and Soul Spec (soulspec.org). All experiments reproducible locally.

submitted by /u/tomleelive
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/21DailyView insight →

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Dev.to

Complete Guide: How To Make Money With Ai

Dev.to

I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+

Dev.to

The Demethylation

Dev.to

How BAML Brings Engineering Discipline to LLM-Powered Systems

Dev.to

[R] Combining Identity Anchors + Permission Hierarchies achieves 100% refusal in abliterated LLMs — system prompt only, no fine-tuning

Key Points

💡 Insights using this article

Related Articles

How I Built an AI SDR Agent That Finds Leads and Writes Personalized Cold Emails

Complete Guide: How To Make Money With Ai

I Analyzed My Portfolio with AI and Scored 53/100 — Here's How I Fixed It to 85+

The Demethylation

How BAML Brings Engineering Discipline to LLM-Powered Systems

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer