ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

arXiv cs.LG / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces ITQ3_S, a new 3-bit weight quantization format for LLMs that combines interleaved ternary coding with TurboQuant-style rotation-domain smoothing using the Fast Walsh-Hadamard Transform (FWHT).
It argues conventional 3-bit quantization fails due to heavy-tailed weights and inter-channel outliers, and that pre-rotating with FWHT spreads outlier energy to produce a more near-Gaussian distribution suitable for uniform ternary quantization.
The authors provide an exact, mathematically rigorous dequantization method that inverts FWHT using a 256-point inverse transform fused into CUDA shared-memory loading, aiming for zero-error round-trip fidelity (within a bound determined by the ternary grid).
Experiments on an NVIDIA RTX 5090 report perplexity competitive with FP16 while achieving over 1.5× throughput versus 4-bit alternatives, attributed to optimized DP4A and Tensor Core scheduling in the interleaved layout.
Overall, ITQ3_S is positioned as a practical, high-fidelity quantization approach for deploying LLMs on consumer-grade hardware with strong quality–speed tradeoffs.

Abstract

We present \textbf{ITQ3\_S} (Interleaved Ternary Quantization -- Specialized), a novel 3-bit weight quantization format for large language models (LLMs) that integrates \textbf{TurboQuant (TQ)}, a rotation-domain adaptive quantization strategy based on the Fast Walsh-Hadamard Transform (FWHT). Conventional 3-bit quantization methods suffer from catastrophic precision loss caused by heavy-tailed weight distributions and inter-channel outliers. ITQ3\_S addresses this fundamental limitation by pre-rotating the weight space via FWHT prior to quantization, effectively spreading outlier energy across the entire vector and inducing a near-Gaussian distribution amenable to uniform ternary coding. Critically, we derive a mathematically rigorous dequantization procedure that inverts the FWHT exactly using a 256-point Inverse Walsh-Hadamard Transform fused into the CUDA shared-memory loading stage, ensuring zero-error round-trip fidelity between offline quantization and online inference. We prove that for any weight vector

\mathbf{w} \in \mathbb{R}^{256}

processed by our pipeline, the reconstruction satisfies

\|\hat{\mathbf{w}} - \mathbf{w}\|_2 \leq \epsilon_q

, where

\epsilon_q

is determined solely by the ternary quantization grid and is strictly smaller than any uniform 3-bit baseline under equal bit-budget constraints. Empirically, on the NVIDIA RTX 5090 (Blackwell architecture), ITQ3\_S achieves perplexity competitive with FP16 baselines while delivering throughput exceeding 1.5

\times

that of 4-bit alternatives, owing to optimized DP4A and Tensor Core scheduling in the interleaved memory layout. Our results establish ITQ3\_S as a practical, mathematically grounded solution for high-fidelity LLM deployment on consumer-grade hardware.

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

Reddit r/MachineLearning

BYOK is not just a pricing model: why it changes AI product trust

Dev.to

AI Citation Registries and Identity Persistence Across Records

Dev.to

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Dev.to

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

Dev.to

ITQ3_S: High-Fidelity 3-bit LLM Inference via Interleaved Ternary Quantization with Rotation-Domain Smoothing

Key Points

Abstract

Related Articles

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

BYOK is not just a pricing model: why it changes AI product trust

AI Citation Registries and Identity Persistence Across Records

Building Real-Time AI Voice Agents with Google Gemini 3.1 Flash Live and VideoSDK

Your Knowledge, Your Model: A Method for Deterministic Knowledge Externalization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer