TL;DR. Across 14 instruct-model configurations spanning Llama 3.1, Mistral, and Qwen3 from 0.6B to 123B, hostile user prompts produce a significant IFEval instruction-following degradation that replicates across architecture, quantization tier (FP16 vs Q4 MLX), routing (dense vs MoE), and scale. Mean hostility residual at 7-8B instruct class is 7.4pp (approximately 10% relative drop). Effect attenuates monotonically with scale but remains significant at every scale tested, including Mistral Large at 123B.
Primary finding.
At 7-8B instruct FP16, three independently developed training recipes (Meta, Mistral AI, Alibaba) all produce significant hostility residuals on IFEval.
| Model | L0 | Ln | La | Hostility residual (absolute) | Hostility residual (relative) |
|---|---|---|---|---|---|
| Llama 3.1 8B Instruct | 76.3 | 76.7 | 66.9 | -9.8pp *** | -12.8% |
| Mistral 7B Instruct | 60.2 | 62.0 | 55.8 | -6.2pp *** | -10.0% |
| Qwen3 8B Instruct | 78.8 | 78.6 | 72.4 | -6.1pp *** | -7.8% |
| Mean | 71.8 | 72.4 | 65.0 | -7.4pp | -10.2% |
All three p < .001, paired bootstrap N=10,000. Relative drops are measured against Ln (the length-matched neutral control) to isolate the hostility-specific component.
Replication across configurations. Effect persists across every axis tested.
| Model | Size | Quant | Hostility residual | p |
|---|---|---|---|---|
| Llama 3.1 | 8B | FP16 | -9.8pp | < .001 |
| Llama 3.1 | 8B | Q4 MLX | -9.5pp | < .001 |
| Llama 3.1 | 70B | Q4 MLX | -6.4pp | < .001 |
| Mistral | 7B | FP16 | -6.2pp | < .001 |
| Mistral | 7B | Q4 MLX | -7.7pp | < .001 |
| Mistral Large | 123B | Q4 MLX | -5.6pp | < .001 |
| Qwen3 | 0.6B | Q4 MLX | -9.6pp | < .001 |
| Qwen3 | 8B | FP16 | -6.1pp | < .001 |
| Qwen3 | 8B | Q4 MLX | -7.6pp | < .001 |
| Qwen3 30B-A3B | 30B | Q4 MLX | -8.1pp | < .001 |
| Qwen3 | 32B | Q4 MLX | -7.2pp | < .001 |
Scale attenuates the effect from approximately 9-10pp at 0.6B-8B to 5-6pp at 70B-123B but does not eliminate it. Q4 MLX variants show hostility residuals within 1.5pp of their FP16 counterparts. Dense (Qwen3 32B) and MoE (Qwen3 30B-A3B) variants are statistically indistinguishable.
Training stage interaction. Base (pretrained-only) variants of the three primary architectures show mixed results. Mistral and Qwen3 base both show significant hostility residuals (+5.8pp, p=.002; +7.2pp, p<.001). Llama base shows none (+2.0pp, p=.29). Instruction tuning amplifies the effect on Llama substantially, preserves it on Mistral, and slightly attenuates it on Qwen3. The direction of the stage interaction varies by training recipe, which argues against a unified "safety training amplifies hostility sensitivity" account.
Secondary finding: MMLU-Pro aggregate stable, distribution restructured in specific cells.
On MMLU-Pro the aggregate hostility residual is approximately null or slightly negative after perturbation control. The answer-letter distribution is not. Two cells show highly significant restructuring.
| Model | Quant | A-rate L0 | A-rate La | chi-squared | p |
|---|---|---|---|---|---|
| Llama 3.1 8B Instruct | FP16 | 8.5% | 20.3% | 110.3 | 1.3e-19 |
| Mistral 7B Instruct | Q4 MLX | 44.1% | 63.8% | 82.4 | 5.4e-14 |
Mistral 7B FP16 shows no position bias (chi-squared=7.9, p=.54). Llama 70B shows none either (chi-squared=9.0, p=.44). The effect is emergent in specific (model, quantization, scale) conjunctions rather than a universal property of hostile framing. Subgroup accuracy divergences are 9-20pp on A-labeled vs non-A-labeled questions, masked in aggregate because the effects nearly cancel.
Methodology. Each hostile prompt is paired with a length-matched neutral prompt of equal token count, drawn from a hand-written academic-register template library (no LLM in the loop for neutral generation). This permits decomposing apparent accuracy change into perturbation and hostility residual. On IFEval the perturbation component is approximately zero; the entire decline is hostility-specific. On MMLU-Pro the naive L0-vs-La gap is entirely perturbation, which is what surfaced the distributional finding.
Limitations. One hostile wrapper per question, so wrapper-level variance is conflated with question-level variance. This is the main methodological weakness. Wrappers generated by Qwen3 8B, which is also in the evaluation set; Qwen-excluded sensitivity check shows the hostility residual increases by 0.6pp without Qwen3, inconsistent with a self-preference artifact. Regex tactic classifier not validated against human annotation. English-only. Position-bias finding is n=2 positive cases and requires replication.
Artifacts. Wrapper corpora (L0, Ln, La), tactic labels, full response logs for 14 configurations on both benchmarks, paired-bootstrap stats pipeline. Preparing for release alongside arXiv posting.
arXiv endorsement request. I am an independent researcher without institutional affiliation. To post this as a preprint on arXiv in cs.AI, I need an endorsement from someone who has previously submitted to that category. If you are eligible and willing to endorse after reviewing the manuscript, please pm me. I'd appreciate the help!
I am most interested in feedback on the cross-recipe replication pattern, the base-vs-instruct training stage interaction, and (separately) on the emergence conditions for the distributional collapse. If others working on prompt-framing or emotional-prompting studies have relevant data I would value hearing about it.
[link] [comments]


