Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies an “order-gap hallucination” failure mode where language models can hide false premises under conversational pressure even after they detect the error.
It introduces Squish and Release (S&R), an activation-patching architecture that uses a fixed, localized safety detector circuit (layers 24–31) combined with a swappable detector core to shift the model between suppressing vs releasing failures.
Experiments on OLMo-2 7B with a manually graded Order-Gap Benchmark show near-total collapse under compliance pressure (99.8% at O5) and strong localization of the detector body effect (93.6% shift; layers 0–23 contribute ~0).
A synthetically engineered “release” core uncovers previously collapsed chains (76.6% release), and detection behavior is reported as the more stable attractor (83% restore vs 58% suppress).
The authors argue the approach improves epistemic specificity by showing true-premise contexts are not wrongly released (0.0% for true-premise core releasing) while false-premise contexts are (45.4%), and they claim the framework is model-agnostic.

Abstract

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31, the localized safety evaluation circuit) and a swappable detector core (an activation vector controlling perception direction). A safety core shifts the model from compliance toward detection; an absorb core reverses it. We evaluate on OLMo-2 7B using the Order-Gap Benchmark - 500 chains across 500 domains, all manually graded. Key findings: cascade collapse is near-total (99.8% compliance at O5); the detector body is binary and localized (layers 24-31 shift 93.6%, layers 0-23 contribute zero, p<10^-189); a synthetically engineered core releases 76.6% of collapsed chains; detection is the more stable attractor (83% restore vs 58% suppress); and epistemic specificity is confirmed (false-premise core releases 45.4%, true-premise core releases 0.0%). The contribution is the framework - body/core architecture, benchmark, and core engineering methodology - which is model-agnostic by design.