FAAR: Format-Aware Adaptive Rounding for NVFP4

arXiv cs.AI / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses the challenge of deploying LLMs on edge devices using ultra-low-bit NVFP4 quantization, where standard rounding strategies ignore the format’s non-uniform numeric grid and lead to larger quantization errors.
  • It introduces Format-Aware Adaptive Rounding (FAAR), a learnable rounding method that incorporates NVFP4 grid non-uniformity and uses loss-gradient–guided rounding decisions to approximate optimal quantization.
  • To further reduce the performance gap, the authors propose a 2-stages Format Alignment (2FA) fine-tuning approach that aligns LLM parameters layer-by-layer to the NVFP4 numerical space.
  • The method shows strong empirical gains with low training overhead (about 4 GPU hours on Llama3-1B) and reports perplexity reductions versus Round-to-Nearest on WikiText-2 (e.g., 14.28→12.60 for Llama3-1B and 23.06→21.27 for Qwen3-1.7B).
  • Across multiple zero-shot downstream tasks, FAAR is reported to outperform state-of-the-art quantization approaches consistently.

Abstract

Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.