Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models
arXiv cs.CL / 4/29/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why reinforcement learning (RL) post-training can improve large language model (LLM) reasoning across domains, while supervised fine-tuning (SFT) often causes general capability forgetting.
- Using a controlled experimental setup (RL- and SFT-tuned models trained from the same base model on identical data) and a feature-level mechanistic analysis, the authors align internal activations across models to track how features change during post-training.
- The results show SFT rapidly creates many specialized features that stabilize early, whereas RL makes more restrained, continuously evolving feature changes that largely preserve the base model’s representations.
- For cases where RL succeeds but the base model fails, the authors identify a compact, task-agnostic set of features that mediates generalization, and causal experiments (feature disabling/amplifying) confirm these features’ direct role.
- An accompanying interpretability methodology and released code enable others to probe and manipulate feature-level mechanisms behind RL generalization (https://github.com/danshi777/RL-generalization).
Related Articles
LLMs will be a commodity
Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring
Dev.to