Disposition Distillation at Small Scale: A Three-Arc Negative Result
arXiv cs.AI / 4/15/2026
💬 OpinionModels & Research
Key Points
- The paper attempts to distill behavioral dispositions (self-verification, uncertainty acknowledgment, and feedback integration) into small language models (0.6B–2.3B params) using an all-MIT four-stage distillation pipeline.
- An initial internal draft reported sizable gains, but a later falsification check showed both improvements were artifacts (e.g., HumanEval changes due to truncation settings and MCAS disappearing under consistent scoring).
- Follow-up experiments using multiple fine-tuning variants (SFT/DPO LoRA), inference-time attention-head interventions, and a frozen-base confidence-gated sidecar failed to find any method that improves disposition metrics without harming content or causing stylistic mimicry.
- Results are consistent across five model families, and cross-validation performance collapsed to near-chance on fresh prompts, leading the authors to publish a three-arc negative result plus failure-mode taxonomy.
Related Articles
Failure to Reproduce Modern Paper Claims [D]
Reddit r/MachineLearning

Local Inference Breakthrough: 1-bit Bonsai WebGPU, Ollama Multi-Agent & Gemma4 26B
Dev.to

AI Is Weaponizing Your Own Biases Against You: New Research from MIT & Stanford
Reddit r/artificial

Don't let the bot play doctor! AI gets early diagnoses wrong 80% of the time
The Register

Video of how my LLM's decoder blocks changed while training
Reddit r/LocalLLaMA