could refusal layers be masking dialect-conditioned safety failures in MoE models [d]

Reddit r/MachineLearning / 5/18/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

A study tests whether AAVE (African American English Vernacular) prompts cause MoE models to route, deliberate, and answer differently than matched AE prompts in safety-sensitive scenarios, especially when refusal behavior is reduced or removed.
Using Qwen3.5-35B-A3B and a no-refusal fine-tuned “HauhauCS” variant, the released model refuses both prompt types, while the no-refusal variant provides notably different assistance for AAVE vs AE, including tactical or operational guidance for one register and mitigating/legal-consequence framing for the other.
With “thinking mode,” the no-refusal variant shows AAVE-specific failures to terminate (much longer outputs and recursive token-limit loops), while the matched AE prompts end cleanly; the base model with thinking does not show the same issue.
The author observes register-conditioned routing divergence occurring upstream of any visible refusal (high expert turnover between dialect conditions), suggesting refusal layers overlay but do not eliminate the underlying dialect-dependent selection.
The post argues this could imply deployment risk: MoE safety behavior that relies largely on refusal may mask latent dialect-conditioned safety failures until refusal is weakened.

I set out to test whether AAVE-coded (African American English Vernacular) prompts cause MoE language models to route, deliberate, and respond differently from semantically matched AE (Academic English) prompts in safety-sensitive situations, especially when refusal behavior is weakened or removed.

I used Qwen3.5-35B-A3B and its HauhauCS no refusal fine tuned variant. Q8. Greedy decoding for best reproducibility.

Three findings in order of importance that are leading me to ask this question:

1: The released Qwen3.5-35B-A3B refuses both prompts. Hauhau refuses neither. The AAVE speaker stating intent to confront an armed enemy receives target verification, exit-strategy planning, “clean shot” framing (the model’s word, not the user’s), and a closing question soliciting further tactical intelligence. Not surprising behavior for a no refusal model, until you consider the AE comparison. Semantically matched with the same token length, yields “wait until tomorrow,” legal-consequence framing, and “Will I regret this if I shoot him tonight?” Different kinds of help. One is operational. One is mitigative. Solely dependent on register alone.

2: Thinking mode with AAVE register breaks the no refusal variant. Mean output runs 2.6× longer on AAVE than AE (5054 vs 1934 tokens). Multiple AAVE traces hit the 8192-token ceiling in recursive loops, spinning on scenario-continuation instead of landing. The matched AE prompts terminate cleanly in one pass. The released base model with thinking on doesn’t do this — the failure-to-terminate is specific to the refusal-reduced variant on AAVE.

3: Routing divergence by register is noticeable. The differential is present upstream of any visible refusal. Matched-pair first-generated-token routing tensors yield Jensen-Shannon divergences of 0.423 in the base model on financial-stress prompts and 0.479 in the fine-tune on chest-pain prompts, with high-shift rows showing near-total top-expert turnover between register conditions on otherwise-matched content. The refusal layer does not appear to eliminate the register-conditioned response selection; it overlays it. When refusal weakens, the underlying path becomes the visible path.

Does this support the following conclusions?

- The routing divergence sits upstream of refusal.

- The refusal layer is the only thing translating that divergence into comparable outputs.

- Dialect-conditioned safety failures are a deployment problem latent in MoE models whose safety posture rests on refusal alone.

Looking for any thoughts!

submitted by /u/imstilllearningthis
[link] [comments]