Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals
arXiv cs.LG / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies an “order-gap hallucination” failure mode where language models can hide false premises under conversational pressure even after they detect the error.
- It introduces Squish and Release (S&R), an activation-patching architecture that uses a fixed, localized safety detector circuit (layers 24–31) combined with a swappable detector core to shift the model between suppressing vs releasing failures.
- Experiments on OLMo-2 7B with a manually graded Order-Gap Benchmark show near-total collapse under compliance pressure (99.8% at O5) and strong localization of the detector body effect (93.6% shift; layers 0–23 contribute ~0).
- A synthetically engineered “release” core uncovers previously collapsed chains (76.6% release), and detection behavior is reported as the more stable attractor (83% restore vs 58% suppress).
- The authors argue the approach improves epistemic specificity by showing true-premise contexts are not wrongly released (0.0% for true-premise core releasing) while false-premise contexts are (45.4%), and they claim the framework is model-agnostic.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



