Thinking in Different Spaces: Domain-Specific Latent Geometry Survives Cross-Architecture Translation

arXiv cs.LG / 2026/3/24

💬 オピニオンIdeas & Deep AnalysisModels & Research

要点

  • The paper tests whether independently trained language models develop geometrically compatible latent representations and whether those can be used to correct behavior at inference time without updating weights.
  • It learns a linear projection that maps teacher activations into a student model’s latent coordinate system, then intervenes by substituting the student residual stream with the translated teacher state during generation.
  • Across 20 heterogeneous teacher–student architecture pairings (including MoE, dense, code-specialized, and synthetic variants), the Ridge-based projection yields substantial reasoning performance improvements (reported R^2 values for verbal and math), while control settings (permutation, L1) largely fail.
  • Despite stronger projection fits, the study finds near-zero correlation between latent geometric alignment quality and behavioral correction rate, and shows architecture- and domain-specific intervention sensitivity (sometimes inverting across domains).
  • A double-dissociation transfer experiment shows catastrophic collapse of learned projections when moved across different reasoning domains, supporting the claim that domain-specific latent subspace geometry is a universal property of LMs.

Abstract

We investigate whether independently trained language models converge to geometrically compatible latent representations, and whether this compatibility can be exploited to correct model behavior at inference time without any weight updates. We learn a linear projection matrix that maps activation vectors from a large teacher model into the coordinate system of a smaller student model, then intervene on the student's residual stream during generation by substituting its internal state with the translated teacher representation. Across a fully crossed experimental matrix of 20 heterogeneous teacher-student pairings spanning mixture-of-experts, dense, code-specialized, and synthetically trained architectures, the Ridge projection consistently achieves R^2 = 0.50 on verbal reasoning and R^2 = 0.40 on mathematical reasoning, collapsing to R^2 = -0.22 under permutation control and R^2 = 0.01 under L_1 regularization. Behavioral correction rates range from 14.0% to 50.0% on TruthfulQA (mean 25.2%) and from 8.5% to 43.3% on GSM8K arithmetic reasoning (mean 25.5%), demonstrating that the method generalizes across fundamentally different reasoning domains. We report a near-zero correlation between geometric alignment quality and behavioral correction rate (r = -0.07), revealing a dissociation between representation space fidelity and output space impact. Intervention strength is architecture-specific: student models exhibit characteristic sensitivity profiles that invert across domains, with the most steerable verbal student becoming the least steerable mathematical student. Finally, a double dissociation experiment conducted across all 20 model pairings confirms without exception that projection matrices collapse catastrophically when transferred across reasoning domains (mean R^2 = -3.83 in both transfer directions), establishing domain-specific subspace geometry as a universal property of LMs.