TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling
arXiv cs.CL / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces TernaryLM, a 132M-parameter transformer trained from scratch using native ternary quantization {-1, 0, +1}, targeting large memory savings for resource-constrained deployment.
- It avoids post-training quantization by using quantization-aware training from initialization with straight-through estimators and adaptive per-layer scaling factors to preserve language modeling quality.
- Experiments on TinyStories report stable performance (validation perplexity 58.42 ± 0.17 across seeds), while downstream transfer on MRPC reaches 82.47% F1 and outperforms DistilBERT despite far less pretraining data.
- The model achieves about a 2.4× memory reduction versus an FP32 baseline (498 MB vs 1,197 MB) with latency parity, indicating practical efficiency rather than just academic compression.
- Layer-wise analysis finds middle layers (L5–L9) reach higher effective ternary sparsity (60–62%) than boundary layers (45–55%), suggesting non-uniform precision allocation as a design principle; code and trained models are released on GitHub.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to