TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces TernaryLM, a 132M-parameter transformer trained from scratch using native ternary quantization {-1, 0, +1}, targeting large memory savings for resource-constrained deployment.
  • It avoids post-training quantization by using quantization-aware training from initialization with straight-through estimators and adaptive per-layer scaling factors to preserve language modeling quality.
  • Experiments on TinyStories report stable performance (validation perplexity 58.42 ± 0.17 across seeds), while downstream transfer on MRPC reaches 82.47% F1 and outperforms DistilBERT despite far less pretraining data.
  • The model achieves about a 2.4× memory reduction versus an FP32 baseline (498 MB vs 1,197 MB) with latency parity, indicating practical efficiency rather than just academic compression.
  • Layer-wise analysis finds middle layers (L5–L9) reach higher effective ternary sparsity (60–62%) than boundary layers (45–55%), suggesting non-uniform precision allocation as a design principle; code and trained models are released on GitHub.

Abstract

Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M-parameter transformer trained natively with ternary quantization {-1, 0, +1} (log2(3) ~ 1.58-bit effective precision), achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories with a cross-seed standard deviation of +/- 0.17 PPL, confirming stable optimization; (2) strong downstream transfer with 82.47% F1 on MRPC, surpassing DistilBERT despite using 55x less pretraining data; (3) 2.4x memory reduction (498 MB vs 1,197 MB for an FP32 model of identical architecture) with latency parity; and (4) an implicit regularization effect whereby the ternary constraint yields a train/val ratio of 1.05x versus 3.51x for the FP32 baseline, demonstrating that discrete weights prevent overfitting on small corpora. We provide layer-wise sparsity analysis revealing that middle transformer layers (L5-L9) achieve 60-62% quantization sparsity versus 45-55% for boundary layers, establishing an actionable design principle for non-uniform precision allocation. Our implementation and trained models are publicly available at https://github.com/1nisharg/TernaryLM-Memory-Efficient-Language-Modeling.