Model Spec Midtraining: Improving How Alignment Training Generalizes

arXiv cs.AI / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Standard alignment fine-tuning based on demonstrations of Model Spec behavior can lead to shallow alignment that generalizes poorly when the demonstration data underspecifies the desired generalization.
The article proposes Model Spec Midtraining (MSM), which trains models on synthetic documents about the Model Spec after pretraining but before alignment fine-tuning, so models learn the spec’s content and learn how to generalize from later demonstrations.
MSM improves controlled generalization in illustrative cases (e.g., the same demonstration about cheese preferences can generalize to pro-America vs. pro-affordability depending on how the spec attributes the preferences).
For safety-relevant behavior, MSM can substantially reduce agentic misalignment rates (Qwen3-32B: 54% to 7%), outperforming a deliberative alignment baseline (14%).
The authors also use MSM to study which spec formats work best, finding that explaining the values behind rules and providing specific rather than general guidance strengthens alignment generalization.

Abstract

Some frontier AI developers aim to align language models to a Model Spec or Constitution that describes the intended model behavior. However, standard alignment fine-tuning -- training on demonstrations of spec-aligned behavior -- can produce shallow alignment that generalizes poorly, in part because demonstration data can underspecify the desired generalization. We introduce model spec midtraining (MSM): after pre-training but before alignment fine-tuning, we train models on synthetic documents discussing their Model Spec. This teaches models the content of the spec, thereby shaping how they generalize from subsequent demonstration data. For example, a model fine-tuned only to express certain cheese preferences, such as "I prefer cream cheese over brie", generalizes to broadly pro-America values when we apply MSM with a spec attributing those preferences to pro-America values. Conversely, a spec about pro-affordability values instead yields pro-affordability generalization from the exact same cheese fine-tuning. MSM can also shape complex safety-relevant propensities: applying MSM with a spec addressing self-preservation and goal-guarding substantially reduces agentic misalignment rate (Qwen3-32B: 54% to 7%), beating a deliberative alignment baseline (14%). We further use MSM as a tool to study which Model Specs produce the strongest alignment generalization, finding that explaining the values underlying rules improves generalization, as does providing specific rather than general guidance. Overall, MSM is a simple, effective technique for controlling and improving how models generalize from alignment training by first teaching them the intended generalization.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

AI is getting better at doing things, but still bad at deciding what to do?

Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

Model Spec Midtraining: Improving How Alignment Training Generalizes

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

AI is getting better at doing things, but still bad at deciding what to do?

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer