Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT

arXiv cs.CL / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes compressing Transformer language models by decomposing weight matrices with Matrix Product Operator (MPO) factorization, using the bond dimension \(\chi\) to control approximation quality.
  • Using PicoGPT (a ~1M-parameter GPT-2-style character model), the authors replace every nn.Linear layer with an MPOLinear module parameterized as an MPO chain and train it with standard PyTorch autograd (no custom backward pass).
  • They compare initialization strategies (TT-SVD from pretrained dense weights vs random) and evaluate multiple \(\chi\) values (4, 8, 16, 32) on Tiny Shakespeare across different factorization schemes tied to distinct weight shapes in the model.
  • Results show up to ~13x compression per transformer block at \(\chi=4\), and at \(\chi=16\) the model retains 97.7% of baseline token accuracy while using far fewer parameters (191,872 vs 1,020,224).
  • The \(\chi=8\) configuration achieves the best accuracy-per-parameter tradeoff, improving over the dense baseline by ~2.7x, supporting MPO parameterization as a practical alternative to low-rank methods and unstructured pruning.

Abstract

Transformer-based language models achieve strong performance across NLP tasks, but their quadratic parameter scaling with hidden dimension makes deployment on resource-constrained hardware expensive. We study Matrix Product Operator (MPO) decomposition as a principled compression method for transformers. MPO factorises weight matrices into chains of low-rank cores, with approximation quality controlled by the bond dimension chi. We replace every nn.Linear layer in PicoGPT, a GPT-2-style character-level language model with about 1M parameters, with an MPOLinear module parameterised as an MPO chain. Cores are initialised either by TT-SVD from pretrained dense weights or from random initialisation, and trained using standard PyTorch autograd without a custom backward pass. We derive balanced factorisation schemes for the five distinct weight shapes in PicoGPT and evaluate bond dimensions chi in {4, 8, 16, 32} on Tiny Shakespeare. MPO compression achieves up to 13x compression per transformer block at chi = 4. At chi = 16, the model uses 191,872 parameters instead of 1,020,224 while retaining 97.7% of baseline token accuracy (51.6% vs 52.8%). Reconstruction error follows the expected trend and is lower for three-site than two-site factorisations at the same bond dimension. The chi = 8 model gives the best accuracy per parameter, exceeding the dense baseline by 2.7x on this metric. These results show that MPO parameterisation is a practical and theoretically grounded alternative to low-rank methods and unstructured pruning for transformer compression.