FAAR：NVFP4向けのフォーマット対応型適応丸め（Format-Aware Adaptive Rounding）

arXiv cs.AI / 2026/3/25

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

本論文は、超低ビットのNVFP4量子化によってエッジデバイスにLLMを展開する際の課題を扱う。標準的な丸め戦略はフォーマットの非一様な数値グリッドを無視するため、量子化誤差が大きくなる。
NVFP4グリッドの非一様性を取り込み、損失勾配に基づく丸め判断を用いて、最適な量子化に近い近似を行う学習可能な丸め手法「Format-Aware Adaptive Rounding（FAAR）」を提案する。
さらに性能ギャップを縮小するため、著者らは2段階のフォーマット整合（2FA）による微調整を提案し、LLMのパラメータを層ごとにNVFP4の数値空間へ整合させる。
本手法は、訓練オーバーヘッドが小さい（Llama3-1Bで約4 GPU時間）にもかかわらず、Round-to-Nearestに比べて高い実証的な改善を示す。WikiText-2でのパープレキシティ低下として、例えばLlama3-1Bで14.28→12.60、Qwen3-1.7Bで23.06→21.27が報告されている。
複数のゼロショット下流タスクにおいて、FAARは既存の最先端量子化アプローチよりも一貫して優れていると報告されている。

Abstract

Deploying large language models (LLMs) on edge devices requires extremely low-bit quantization. Ultra-low precision formats such as NVFP4 offer a promising solution for reducing memory footprint and accelerating computation. However, existing quantization methods typically rely on conventional rounding strategies and fail to account for the non-uniformity of the NVFP4 numerical grid, resulting in suboptimal rounding decisions and amplified quantization errors. To address this, we propose Format-Aware Adaptive Rounding (FAAR), a learnable rounding strategy tailored for the NVFP4 format. Unlike conventional quantization paradigms, FAAR explicitly incorporates the non-uniform NVFP4 grid into the optimization process. By adaptively adjusting rounding decisions guided by loss gradients, our method effectively approximates the theoretically optimal quantization. To complement FAAR, we introduce a 2-stages Format Alignment (2FA) fine-tuning scheme that aligns LLM parameters layer-by-layer to the NVFP4 numerical space, further narrowing the performance gap. Remarkably, this learnable optimization incurs a minimal training overhead of only 4 GPU hours on Llama3-1B. Extensive experiments demonstrate the effectiveness of our approach. Compared with Round-to-Nearest (RTN), our method reduces perplexity on WikiText-2 from 14.28 to 12.60 on Llama3-1B and from 23.06 to 21.27 on Qwen3-1.7B. Additionally, our method consistently outperforms state-of-the-art approaches across various zero-shot downstream tasks.

競艇×AI連動──流れを読む女、MIRIA。3/25(水)予告 🖤 本日のMIRIA式ブースト朝のみ帯封ゲット！✨️ブースト調整いい感じです！【MIRIA式競艇予想】

note

AIとロゴス

note

フィジカルAIニュース(2026/3/24号)

note

Speculative Decodingで27Bが逆に遅くなった

Qiita

信号処理の視点で見るデータ分析：共通点の整理と記事まとめ

Qiita

FAAR：NVFP4向けのフォーマット対応型適応丸め（Format-Aware Adaptive Rounding）

要点

Abstract

関連記事

競艇×AI連動──流れを読む女、MIRIA。3/25(水)予告 🖤 本日のMIRIA式ブースト朝のみ帯封ゲット！✨️ブースト調整いい感じです！【MIRIA式競艇予想】

AIとロゴス

フィジカルAIニュース(2026/3/24号)

Speculative Decodingで27Bが逆に遅くなった

信号処理の視点で見るデータ分析：共通点の整理と記事まとめ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer