Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

arXiv cs.LG / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes “Borrowed Geometry,” reusing frozen Gemma 4 31B transformer weights pretrained on text tokens without changing them, and transferring them across modalities via a thin trainable interface.
On OGBench robotic manipulation (scene-play-singletask-task1-v0), the frozen-weight approach achieves a +4.33pt gain over published GCIQL at n=3 (std 0.74), reported as a new published SOTA win on a task the substrate was never trained for.
On D4RL Walker2d-medium-v2, it matches Decision-Transformer performance (76.2±0.8 at n=3) while using only 0.43× as many trainable parameters as DT, with the frozen substrate effectively compressed to a 5L slice.
In associative recall, the frozen slice plus a 113K-parameter linear interface attains a per-bit error of 0.0505 (n=2), outperforming a scratch-trained transformer of matched capacity by 8.7× under the same protocol.
Control experiments and a dual-measurement protocol (text-activation probing plus task ablation on a non-language target) argue the effect is not architecture alone, and identify specific transformer heads (e.g., head L26.28) as critical across both measurement schemes.

Abstract

Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0:

+4.33

pt over published GCIQL at

n=3

with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity (

76.2 \pm 0.8

n=3

) at

0.43\times

DT's trainable count, with the frozen substrate compressing to a 5L slice (

+1.66

pt over the 6L baseline at

n=3

). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 (

n=2

); a 6.36M-parameter from-scratch trained transformer at matched capacity (

1/\sqrt{d_k}

scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an

8.7\times

advantage. Architecture-alone falsifications: a frozen random transformer with correct

1/\sqrt{d_k}

scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across

n=3

where pretrained reaches 60%). A dual-measurement protocol -- text-activation probing on 95 English sentences plus task-ablation on a non-language target -- names individual heads independently identifiable on both protocols: head L26.28 scores

3.7\times

the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation (

\Delta

L30

= +0.221

); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

Dev.to

When a memorized rule fits your bug too well: a meta-trap of agent workflows

Dev.to

Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

Key Points

Abstract

Related Articles

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Find 12 high-volume, low-competition GEO content topics Topify.ai should rank on

When a memorized rule fits your bug too well: a meta-trap of agent workflows

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer