Residual Stream Analysis of Overfitting And Structural Disruptions
arXiv cs.LG / 3/17/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The work shows that safety-focused fine-tuning with standard refusal templates yields a higher false-refusal rate on benign prompts, rising from 63% to 84% as safety data goes from 0% to 40%.
- It finds that safety data exhibits substantially lower token entropy and 2-gram diversity (0.048) compared with general instruction data.
- It introduces FlowLens, a stable PCA-based tool for residual-stream geometry analysis to reveal that safety data concentrates variance along a few components, reducing representational smoothness.
- It proposes Variance Concentration Loss (VCL), a regularizer that penalizes excessive variance concentration in mid-layer residuals, reducing false refusals by over 35 percentage points while maintaining or improving performance on benchmarks like MMLU and GSM8K.




