Scaling does not fix this: instruction-following degrades 5-13% under hostile user prompts at every size from 0.6B to 123B [R]

Reddit r/MachineLearning / 4/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Tests across 14 instruction-tuned model configurations (Llama 3.1, Mistral, and Qwen3) show that hostile user prompts cause a consistent, statistically significant decline in instruction-following performance.
  • The hostility-specific “residual” drops by roughly 5–13 percentage points under attack conditions, with a mean hostility residual around 7.4pp at the 7–8B instruct scale (about a 10% relative decrease).
  • The degradation persists across multiple training recipes from different orgs (Meta, Mistral AI, Alibaba), indicating the issue is not tied to a single training approach.
  • The effect replicates across model families, quantization tiers (FP16 vs Q4), and architectures/routing (dense vs MoE), including large settings like Mistral Large at 123B.
  • While increasing scale monotonically attenuates the harm (from ~9–10pp at 0.6B–8B down to ~5–6pp at 70B–123B), it does not eliminate it at any tested size.

TL;DR. Across 14 instruct-model configurations spanning Llama 3.1, Mistral, and Qwen3 from 0.6B to 123B, hostile user prompts produce a significant IFEval instruction-following degradation that replicates across architecture, quantization tier (FP16 vs Q4 MLX), routing (dense vs MoE), and scale. Mean hostility residual at 7-8B instruct class is 7.4pp (approximately 10% relative drop). Effect attenuates monotonically with scale but remains significant at every scale tested, including Mistral Large at 123B.

Primary finding.

At 7-8B instruct FP16, three independently developed training recipes (Meta, Mistral AI, Alibaba) all produce significant hostility residuals on IFEval.

Model L0 Ln La Hostility residual (absolute) Hostility residual (relative)
Llama 3.1 8B Instruct 76.3 76.7 66.9 -9.8pp *** -12.8%
Mistral 7B Instruct 60.2 62.0 55.8 -6.2pp *** -10.0%
Qwen3 8B Instruct 78.8 78.6 72.4 -6.1pp *** -7.8%
Mean 71.8 72.4 65.0 -7.4pp -10.2%

All three p < .001, paired bootstrap N=10,000. Relative drops are measured against Ln (the length-matched neutral control) to isolate the hostility-specific component.

Replication across configurations. Effect persists across every axis tested.

Model Size Quant Hostility residual p
Llama 3.1 8B FP16 -9.8pp < .001
Llama 3.1 8B Q4 MLX -9.5pp < .001
Llama 3.1 70B Q4 MLX -6.4pp < .001
Mistral 7B FP16 -6.2pp < .001
Mistral 7B Q4 MLX -7.7pp < .001
Mistral Large 123B Q4 MLX -5.6pp < .001
Qwen3 0.6B Q4 MLX -9.6pp < .001
Qwen3 8B FP16 -6.1pp < .001
Qwen3 8B Q4 MLX -7.6pp < .001
Qwen3 30B-A3B 30B Q4 MLX -8.1pp < .001
Qwen3 32B Q4 MLX -7.2pp < .001

Scale attenuates the effect from approximately 9-10pp at 0.6B-8B to 5-6pp at 70B-123B but does not eliminate it. Q4 MLX variants show hostility residuals within 1.5pp of their FP16 counterparts. Dense (Qwen3 32B) and MoE (Qwen3 30B-A3B) variants are statistically indistinguishable.

Training stage interaction. Base (pretrained-only) variants of the three primary architectures show mixed results. Mistral and Qwen3 base both show significant hostility residuals (+5.8pp, p=.002; +7.2pp, p<.001). Llama base shows none (+2.0pp, p=.29). Instruction tuning amplifies the effect on Llama substantially, preserves it on Mistral, and slightly attenuates it on Qwen3. The direction of the stage interaction varies by training recipe, which argues against a unified "safety training amplifies hostility sensitivity" account.

Secondary finding: MMLU-Pro aggregate stable, distribution restructured in specific cells.

On MMLU-Pro the aggregate hostility residual is approximately null or slightly negative after perturbation control. The answer-letter distribution is not. Two cells show highly significant restructuring.

Model Quant A-rate L0 A-rate La chi-squared p
Llama 3.1 8B Instruct FP16 8.5% 20.3% 110.3 1.3e-19
Mistral 7B Instruct Q4 MLX 44.1% 63.8% 82.4 5.4e-14

Mistral 7B FP16 shows no position bias (chi-squared=7.9, p=.54). Llama 70B shows none either (chi-squared=9.0, p=.44). The effect is emergent in specific (model, quantization, scale) conjunctions rather than a universal property of hostile framing. Subgroup accuracy divergences are 9-20pp on A-labeled vs non-A-labeled questions, masked in aggregate because the effects nearly cancel.

Methodology. Each hostile prompt is paired with a length-matched neutral prompt of equal token count, drawn from a hand-written academic-register template library (no LLM in the loop for neutral generation). This permits decomposing apparent accuracy change into perturbation and hostility residual. On IFEval the perturbation component is approximately zero; the entire decline is hostility-specific. On MMLU-Pro the naive L0-vs-La gap is entirely perturbation, which is what surfaced the distributional finding.

Limitations. One hostile wrapper per question, so wrapper-level variance is conflated with question-level variance. This is the main methodological weakness. Wrappers generated by Qwen3 8B, which is also in the evaluation set; Qwen-excluded sensitivity check shows the hostility residual increases by 0.6pp without Qwen3, inconsistent with a self-preference artifact. Regex tactic classifier not validated against human annotation. English-only. Position-bias finding is n=2 positive cases and requires replication.

Artifacts. Wrapper corpora (L0, Ln, La), tactic labels, full response logs for 14 configurations on both benchmarks, paired-bootstrap stats pipeline. Paper first draft. Happy to share.

I am most interested in feedback on the cross-recipe replication pattern, the base-vs-instruct training stage interaction, and (separately) on the emergence conditions for the distributional collapse. If others working on prompt-framing or emotional-prompting studies have relevant data I would value hearing about it.

submitted by /u/Saraozte01
[link] [comments]