I tested 14 LLMs from 0.6B to 123B. All of them get worse at following instructions when users are hostile [R]

Reddit r/MachineLearning / 4/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The test found that across 14 instruct-model configurations, LLMs significantly degrade on IFEval instruction-following when users are hostile, with an average hostility residual of 7.4pp at the 7–8B instruct class (about a 10% relative drop).
The degradation replicates across multiple model families (Llama 3.1, Mistral, and Qwen3) and is observed with instruction-tuning recipes developed independently by Meta, Mistral AI, and Alibaba.
The effect is consistent across different implementation factors, including quantization level (FP16 vs Q4 MLX), routing type (dense vs MoE), and model scale.
Increasing model size reduces but does not remove the hostility-induced drop: it shrinks monotonically from roughly 9–10pp at 0.6B–8B to about 5–6pp at 70B–123B, with significant gaps still present even at Mistral Large (123B).
All reported results are statistically significant (e.g., p < .001 using paired bootstrapping with N=10,000), indicating the phenomenon is robust rather than a measurement artifact.

TL;DR. Across 14 instruct-model configurations spanning Llama 3.1, Mistral, and Qwen3 from 0.6B to 123B, hostile user prompts produce a significant IFEval instruction-following degradation that replicates across architecture, quantization tier (FP16 vs Q4 MLX), routing (dense vs MoE), and scale. Mean hostility residual at 7-8B instruct class is 7.4pp (approximately 10% relative drop). Effect attenuates monotonically with scale but remains significant at every scale tested, including Mistral Large at 123B.

Primary finding.

At 7-8B instruct FP16, three independently developed training recipes (Meta, Mistral AI, Alibaba) all produce significant hostility residuals on IFEval.

Model	L0	Ln	La	Hostility residual (absolute)	Hostility residual (relative)
Llama 3.1 8B Instruct	76.3	76.7	66.9	-9.8pp ***	-12.8%
Mistral 7B Instruct	60.2	62.0	55.8	-6.2pp ***	-10.0%
Qwen3 8B Instruct	78.8	78.6	72.4	-6.1pp ***	-7.8%
Mean	71.8	72.4	65.0	-7.4pp	-10.2%

All three p < .001, paired bootstrap N=10,000. Relative drops are measured against Ln (the length-matched neutral control) to isolate the hostility-specific component.

Replication across configurations. Effect persists across every axis tested.

Model	Size	Quant	Hostility residual	p
Llama 3.1	8B	FP16	-9.8pp	< .001
Llama 3.1	8B	Q4 MLX	-9.5pp	< .001
Llama 3.1	70B	Q4 MLX	-6.4pp	< .001
Mistral	7B	FP16	-6.2pp	< .001
Mistral	7B	Q4 MLX	-7.7pp	< .001
Mistral Large	123B	Q4 MLX	-5.6pp	< .001
Qwen3	0.6B	Q4 MLX	-9.6pp	< .001
Qwen3	8B	FP16	-6.1pp	< .001
Qwen3	8B	Q4 MLX	-7.6pp	< .001
Qwen3 30B-A3B	30B	Q4 MLX	-8.1pp	< .001
Qwen3	32B	Q4 MLX	-7.2pp	< .001

Scale attenuates the effect from approximately 9-10pp at 0.6B-8B to 5-6pp at 70B-123B but does not eliminate it. Q4 MLX variants show hostility residuals within 1.5pp of their FP16 counterparts. Dense (Qwen3 32B) and MoE (Qwen3 30B-A3B) variants are statistically indistinguishable.

Training stage interaction. Base (pretrained-only) variants of the three primary architectures show mixed results. Mistral and Qwen3 base both show significant hostility residuals (+5.8pp, p=.002; +7.2pp, p<.001). Llama base shows none (+2.0pp, p=.29). Instruction tuning amplifies the effect on Llama substantially, preserves it on Mistral, and slightly attenuates it on Qwen3. The direction of the stage interaction varies by training recipe, which argues against a unified "safety training amplifies hostility sensitivity" account.

Secondary finding: MMLU-Pro aggregate stable, distribution restructured in specific cells.

On MMLU-Pro the aggregate hostility residual is approximately null or slightly negative after perturbation control. The answer-letter distribution is not. Two cells show highly significant restructuring.

Model	Quant	A-rate L0	A-rate La	chi-squared	p
Llama 3.1 8B Instruct	FP16	8.5%	20.3%	110.3	1.3e-19
Mistral 7B Instruct	Q4 MLX	44.1%	63.8%	82.4	5.4e-14

Mistral 7B FP16 shows no position bias (chi-squared=7.9, p=.54). Llama 70B shows none either (chi-squared=9.0, p=.44). The effect is emergent in specific (model, quantization, scale) conjunctions rather than a universal property of hostile framing. Subgroup accuracy divergences are 9-20pp on A-labeled vs non-A-labeled questions, masked in aggregate because the effects nearly cancel.

Methodology. Each hostile prompt is paired with a length-matched neutral prompt of equal token count, drawn from a hand-written academic-register template library (no LLM in the loop for neutral generation). This permits decomposing apparent accuracy change into perturbation and hostility residual. On IFEval the perturbation component is approximately zero; the entire decline is hostility-specific. On MMLU-Pro the naive L0-vs-La gap is entirely perturbation, which is what surfaced the distributional finding.

Limitations. One hostile wrapper per question, so wrapper-level variance is conflated with question-level variance. This is the main methodological weakness. Wrappers generated by Qwen3 8B, which is also in the evaluation set; Qwen-excluded sensitivity check shows the hostility residual increases by 0.6pp without Qwen3, inconsistent with a self-preference artifact. Regex tactic classifier not validated against human annotation. English-only. Position-bias finding is n=2 positive cases and requires replication.

Artifacts. Wrapper corpora (L0, Ln, La), tactic labels, full response logs for 14 configurations on both benchmarks, paired-bootstrap stats pipeline. Preparing for release alongside arXiv posting.

arXiv endorsement request. I am an independent researcher without institutional affiliation. To post this as a preprint on arXiv in cs.AI, I need an endorsement from someone who has previously submitted to that category. If you are eligible and willing to endorse after reviewing the manuscript, please pm me. I'd appreciate the help!

I am most interested in feedback on the cross-recipe replication pattern, the base-vs-instruct training stage interaction, and (separately) on the emergence conditions for the distributional collapse. If others working on prompt-framing or emotional-prompting studies have relevant data I would value hearing about it.

submitted by /u/Saraozte01
[link] [comments]

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to

r/LocalLLaMa Rule Updates

Reddit r/LocalLLaMA

I tested 14 LLMs from 0.6B to 123B. All of them get worse at following instructions when users are hostile [R]

Key Points

Related Articles

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Deepseek V4 Flash and Non-Flash Out on HuggingFace

DeepSeek V4 Flash & Pro Now out on API

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

r/LocalLLaMa Rule Updates

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer