Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “Borrowed Geometry,” reusing frozen Gemma 4 31B transformer weights pretrained on text tokens without changing them, and transferring them across modalities via a thin trainable interface.
  • On OGBench robotic manipulation (scene-play-singletask-task1-v0), the frozen-weight approach achieves a +4.33pt gain over published GCIQL at n=3 (std 0.74), reported as a new published SOTA win on a task the substrate was never trained for.
  • On D4RL Walker2d-medium-v2, it matches Decision-Transformer performance (76.2±0.8 at n=3) while using only 0.43× as many trainable parameters as DT, with the frozen substrate effectively compressed to a 5L slice.
  • In associative recall, the frozen slice plus a 113K-parameter linear interface attains a per-bit error of 0.0505 (n=2), outperforming a scratch-trained transformer of matched capacity by 8.7× under the same protocol.
  • Control experiments and a dual-measurement protocol (text-activation probing plus task ablation on a non-language target) argue the effect is not architecture alone, and identify specific transformer heads (e.g., head L26.28) as critical across both measurement schemes.

Abstract

Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: +4.33pt over published GCIQL at n=3 with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity (76.2 \pm 0.8, n=3) at 0.43\times DT's trainable count, with the frozen substrate compressing to a 5L slice (+1.66pt over the 6L baseline at n=3). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 (n=2); a 6.36M-parameter from-scratch trained transformer at matched capacity (1/\sqrt{d_k} scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an 8.7\times advantage. Architecture-alone falsifications: a frozen random transformer with correct 1/\sqrt{d_k} scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across n=3 where pretrained reaches 60%). A dual-measurement protocol -- text-activation probing on 95 English sentences plus task-ablation on a non-language target -- names individual heads independently identifiable on both protocols: head L26.28 scores 3.7\times the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation (\Delta L30 = +0.221); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.