Disposition Distillation at Small Scale: A Three-Arc Negative Result

arXiv cs.AI / 4/15/2026

💬 OpinionModels & Research

共有:

Key Points

The paper attempts to distill behavioral dispositions (self-verification, uncertainty acknowledgment, and feedback integration) into small language models (0.6B–2.3B params) using an all-MIT four-stage distillation pipeline.
An initial internal draft reported sizable gains, but a later falsification check showed both improvements were artifacts (e.g., HumanEval changes due to truncation settings and MCAS disappearing under consistent scoring).
Follow-up experiments using multiple fine-tuning variants (SFT/DPO LoRA), inference-time attention-head interventions, and a frozen-base confidence-gated sidecar failed to find any method that improves disposition metrics without harming content or causing stylistic mimicry.
Results are consistent across five model families, and cross-validation performance collapsed to near-chance on fresh prompts, leading the authors to publish a three-arc negative result plus failure-mode taxonomy.

Abstract

We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).

Failure to Reproduce Modern Paper Claims [D]

Reddit r/MachineLearning

Local Inference Breakthrough: 1-bit Bonsai WebGPU, Ollama Multi-Agent & Gemma4 26B

Dev.to

AI Is Weaponizing Your Own Biases Against You: New Research from MIT & Stanford

Reddit r/artificial

Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time

The Register

Video of how my LLM's decoder blocks changed while training

Reddit r/LocalLLaMA

Disposition Distillation at Small Scale: A Three-Arc Negative Result

Key Points

Abstract

Related Articles

Failure to Reproduce Modern Paper Claims [D]

Local Inference Breakthrough: 1-bit Bonsai WebGPU, Ollama Multi-Agent & Gemma4 26B

AI Is Weaponizing Your Own Biases Against You: New Research from MIT & Stanford

Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time

Video of how my LLM's decoder blocks changed while training

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer