Scaling does not fix this: instruction-following degrades 5-13% under hostile user prompts at every size from 0.6B to 123B [R]

Reddit r/MachineLearning / 4/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Tests across 14 instruction-tuned model configurations (Llama 3.1, Mistral, and Qwen3) show that hostile user prompts cause a consistent, statistically significant decline in instruction-following performance.
The hostility-specific “residual” drops by roughly 5–13 percentage points under attack conditions, with a mean hostility residual around 7.4pp at the 7–8B instruct scale (about a 10% relative decrease).
The degradation persists across multiple training recipes from different orgs (Meta, Mistral AI, Alibaba), indicating the issue is not tied to a single training approach.
The effect replicates across model families, quantization tiers (FP16 vs Q4), and architectures/routing (dense vs MoE), including large settings like Mistral Large at 123B.
While increasing scale monotonically attenuates the harm (from ~9–10pp at 0.6B–8B down to ~5–6pp at 70B–123B), it does not eliminate it at any tested size.

TL;DR. Across 14 instruct-model configurations spanning Llama 3.1, Mistral, and Qwen3 from 0.6B to 123B, hostile user prompts produce a significant IFEval instruction-following degradation that replicates across architecture, quantization tier (FP16 vs Q4 MLX), routing (dense vs MoE), and scale. Mean hostility residual at 7-8B instruct class is 7.4pp (approximately 10% relative drop). Effect attenuates monotonically with scale but remains significant at every scale tested, including Mistral Large at 123B.

Primary finding.

At 7-8B instruct FP16, three independently developed training recipes (Meta, Mistral AI, Alibaba) all produce significant hostility residuals on IFEval.

Model	L0	Ln	La	Hostility residual (absolute)	Hostility residual (relative)
Llama 3.1 8B Instruct	76.3	76.7	66.9	-9.8pp ***	-12.8%
Mistral 7B Instruct	60.2	62.0	55.8	-6.2pp ***	-10.0%
Qwen3 8B Instruct	78.8	78.6	72.4	-6.1pp ***	-7.8%
Mean	71.8	72.4	65.0	-7.4pp	-10.2%

All three p < .001, paired bootstrap N=10,000. Relative drops are measured against Ln (the length-matched neutral control) to isolate the hostility-specific component.

Replication across configurations. Effect persists across every axis tested.

Model	Size	Quant	Hostility residual	p
Llama 3.1	8B	FP16	-9.8pp	< .001
Llama 3.1	8B	Q4 MLX	-9.5pp	< .001
Llama 3.1	70B	Q4 MLX	-6.4pp	< .001
Mistral	7B	FP16	-6.2pp	< .001
Mistral	7B	Q4 MLX	-7.7pp	< .001
Mistral Large	123B	Q4 MLX	-5.6pp	< .001
Qwen3	0.6B	Q4 MLX	-9.6pp	< .001
Qwen3	8B	FP16	-6.1pp	< .001
Qwen3	8B	Q4 MLX	-7.6pp	< .001
Qwen3 30B-A3B	30B	Q4 MLX	-8.1pp	< .001
Qwen3	32B	Q4 MLX	-7.2pp	< .001

Scale attenuates the effect from approximately 9-10pp at 0.6B-8B to 5-6pp at 70B-123B but does not eliminate it. Q4 MLX variants show hostility residuals within 1.5pp of their FP16 counterparts. Dense (Qwen3 32B) and MoE (Qwen3 30B-A3B) variants are statistically indistinguishable.

Training stage interaction. Base (pretrained-only) variants of the three primary architectures show mixed results. Mistral and Qwen3 base both show significant hostility residuals (+5.8pp, p=.002; +7.2pp, p<.001). Llama base shows none (+2.0pp, p=.29). Instruction tuning amplifies the effect on Llama substantially, preserves it on Mistral, and slightly attenuates it on Qwen3. The direction of the stage interaction varies by training recipe, which argues against a unified "safety training amplifies hostility sensitivity" account.

Secondary finding: MMLU-Pro aggregate stable, distribution restructured in specific cells.

On MMLU-Pro the aggregate hostility residual is approximately null or slightly negative after perturbation control. The answer-letter distribution is not. Two cells show highly significant restructuring.

Model	Quant	A-rate L0	A-rate La	chi-squared	p
Llama 3.1 8B Instruct	FP16	8.5%	20.3%	110.3	1.3e-19
Mistral 7B Instruct	Q4 MLX	44.1%	63.8%	82.4	5.4e-14

Mistral 7B FP16 shows no position bias (chi-squared=7.9, p=.54). Llama 70B shows none either (chi-squared=9.0, p=.44). The effect is emergent in specific (model, quantization, scale) conjunctions rather than a universal property of hostile framing. Subgroup accuracy divergences are 9-20pp on A-labeled vs non-A-labeled questions, masked in aggregate because the effects nearly cancel.

Methodology. Each hostile prompt is paired with a length-matched neutral prompt of equal token count, drawn from a hand-written academic-register template library (no LLM in the loop for neutral generation). This permits decomposing apparent accuracy change into perturbation and hostility residual. On IFEval the perturbation component is approximately zero; the entire decline is hostility-specific. On MMLU-Pro the naive L0-vs-La gap is entirely perturbation, which is what surfaced the distributional finding.

Limitations. One hostile wrapper per question, so wrapper-level variance is conflated with question-level variance. This is the main methodological weakness. Wrappers generated by Qwen3 8B, which is also in the evaluation set; Qwen-excluded sensitivity check shows the hostility residual increases by 0.6pp without Qwen3, inconsistent with a self-preference artifact. Regex tactic classifier not validated against human annotation. English-only. Position-bias finding is n=2 positive cases and requires replication.

Artifacts. Wrapper corpora (L0, Ln, La), tactic labels, full response logs for 14 configurations on both benchmarks, paired-bootstrap stats pipeline. Paper first draft. Happy to share.

I am most interested in feedback on the cross-recipe replication pattern, the base-vs-instruct training stage interaction, and (separately) on the emergence conditions for the distributional collapse. If others working on prompt-framing or emotional-prompting studies have relevant data I would value hearing about it.

submitted by /u/Saraozte01
[link] [comments]

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

r/LocalLLaMa Rule Updates

Reddit r/LocalLLaMA

Scaling does not fix this: instruction-following degrades 5-13% under hostile user prompts at every size from 0.6B to 123B [R]

Key Points

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

r/LocalLLaMa Rule Updates

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer