TTQ: Activation-Aware Test-Time Quantization to Accelerate LLM Inference On The Fly

arXiv cs.LG / 3/23/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces TTQ, a test-time quantization framework that compresses large foundation models on the fly during inference without requiring retraining.
It employs online calibration to achieve activation-aware quantization that adapts to every prompt and downstream task, reducing domain-shift issues.
TTQ enables inference speedups by quantizing activations at runtime while maintaining or improving performance compared with state-of-the-art baselines.
The authors conduct experiments showing TTQ outperforms existing activation- and calibration-based quantization methods on large models.

Abstract

To tackle the huge computational demand of large foundation models, activation-aware compression techniques without retraining have been introduced. However, since these methods highly rely on calibration data, domain shift issues may arise for unseen downstream tasks. We propose a test-time quantization (TTQ) framework which compresses large models on the fly at inference time to resolve this issue. With an efficient online calibration, instant activation-aware quantization can adapt every prompt regardless of the downstream tasks, yet achieving inference speedup. Several experiments demonstrate that TTQ can improve the quantization performance over state-of-the-art baselines.