The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

arXiv stat.ML / 4/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes that “geometric stability” (the consistency of a representation’s pairwise distance structure) provides a shared geometric basis for two key LLM needs: predicting steerability under targeted control and detecting internal drift over time.
  • Supervised versions of the Shesha method, which measure task-aligned geometric stability, predict linear steerability with near-perfect accuracy across multiple embedding models and NLP tasks, outperforming signals that rely only on class separability.
  • A key finding is that unsupervised geometric stability does not predict steerability for real-world tasks, indicating that task alignment is essential for controllability prediction.
  • For drift detection, however, unsupervised geometric stability is highly effective, showing much larger sensitivity than CKA, delivering earlier warnings in most models, and reducing false alarms compared with Procrustes during post-training alignment.
  • The authors argue that combining supervised (pre-deployment controllability checks) and unsupervised (post-deployment monitoring) geometric stability yields complementary diagnostics for safer LLM deployment.

Abstract

Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. Supervised Shesha variants that measure task-aligned geometric stability predict linear steerability with near-perfect accuracy (\rho = 0.89-0.97) across 35-69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial \rho = 0.62-0.76). A critical dissociation emerges: unsupervised stability fails entirely for steering on real-world tasks (\rho \approx 0.10), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly 2\times greater geometric change than CKA during post-training alignment (up to 5.23\times in Llama) while providing earlier warning in 73\% of models and maintaining a 6\times lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.