Disposition Distillation at Small Scale: A Three-Arc Negative Result

arXiv cs.AI / 4/15/2026

💬 OpinionModels & Research

Key Points

  • The paper attempts to distill behavioral dispositions (self-verification, uncertainty acknowledgment, and feedback integration) into small language models (0.6B–2.3B params) using an all-MIT four-stage distillation pipeline.
  • An initial internal draft reported sizable gains, but a later falsification check showed both improvements were artifacts (e.g., HumanEval changes due to truncation settings and MCAS disappearing under consistent scoring).
  • Follow-up experiments using multiple fine-tuning variants (SFT/DPO LoRA), inference-time attention-head interventions, and a frozen-base confidence-gated sidecar failed to find any method that improves disposition metrics without harming content or causing stylistic mimicry.
  • Results are consistent across five model families, and cross-validation performance collapsed to near-chance on fresh prompts, leading the authors to publish a three-arc negative result plus failure-mode taxonomy.

Abstract

We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).