Compressing Transformer Language Models via Matrix Product Operator Decomposition: A Case Study on PicoGPT
arXiv cs.CL / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes compressing Transformer language models by decomposing weight matrices with Matrix Product Operator (MPO) factorization, using the bond dimension \(\chi\) to control approximation quality.
- Using PicoGPT (a ~1M-parameter GPT-2-style character model), the authors replace every nn.Linear layer with an MPOLinear module parameterized as an MPO chain and train it with standard PyTorch autograd (no custom backward pass).
- They compare initialization strategies (TT-SVD from pretrained dense weights vs random) and evaluate multiple \(\chi\) values (4, 8, 16, 32) on Tiny Shakespeare across different factorization schemes tied to distinct weight shapes in the model.
- Results show up to ~13x compression per transformer block at \(\chi=4\), and at \(\chi=16\) the model retains 97.7% of baseline token accuracy while using far fewer parameters (191,872 vs 1,020,224).
- The \(\chi=8\) configuration achieves the best accuracy-per-parameter tradeoff, improving over the dense baseline by ~2.7x, supporting MPO parameterization as a practical alternative to low-rank methods and unstructured pruning.



