MoRFI: Monotonic Sparse Autoencoder Feature Identification
arXiv cs.CL / 4/30/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why adding new factual knowledge during post-training can increase hallucinations in LLMs, focusing on a controlled setup for closed-book QA.
- It fine-tunes multiple open models (Llama 3.1 8B, Gemma 2 9B, and Mistral 7B v03) on several single-QA datasets while varying the amount of new knowledge and training epochs, then confirms that more incremental new knowledge (especially with longer training) leads to higher hallucination rates.
- Using pre-trained sparse autoencoders (SAEs), the authors analyze residual stream activations across checkpoints to find latent directions causally linked to hallucinations.
- They propose Monotonic Relationship Feature Identification (MoRFI), which extracts SAE features that change monotonically with controlled fine-tuning mixtures, enabling the discovery of single-latent interventions that can recover stored knowledge.
- The results indicate that exposure to unknown facts can disrupt the model’s ability to retrieve previously stored knowledge along specific residual-stream directions, and the approach generalizes across different model families.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to

Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to

Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to

Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to