Abstract
Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: +4.33pt over published GCIQL at n=3 with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity (76.2 \pm 0.8, n=3) at 0.43\times DT's trainable count, with the frozen substrate compressing to a 5L slice (+1.66pt over the 6L baseline at n=3). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 (n=2); a 6.36M-parameter from-scratch trained transformer at matched capacity (1/\sqrt{d_k} scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an 8.7\times advantage. Architecture-alone falsifications: a frozen random transformer with correct 1/\sqrt{d_k} scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across n=3 where pretrained reaches 60%). A dual-measurement protocol -- text-activation probing on 95 English sentences plus task-ablation on a non-language target -- names individual heads independently identifiable on both protocols: head L26.28 scores 3.7\times the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation (\Delta L30 = +0.221); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.