Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
arXiv cs.LG / 5/5/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper characterizes how “behavioural unlearning” in large language models can leave internal memorization traces detectable by adversarial probes, and shows where this retention resides across layers.
- It introduces a leave-one-out cross-sequence probe to test whether memorization signatures generalize across held-out sequences, reporting consistent signature gaps across Pythia-70M, GPT-2 medium, and Mistral-7B.
- The authors demonstrate causal separability: projecting out the probe direction sharply collapses the memorization signature while behavioral recall changes little, indicating a distinct representational regime.
- They propose “probe-geometry alignment” (PGA), a surgical activation alignment that erases the cross-sequence signature below random chance across multiple scales and remains robust to several probe variants and even re-fitting attacks.
- PGA achieves this erasure with no measurable capability cost, preserving five zero-shot benchmarks within 2.8 percentage points per task on average.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to