Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

arXiv stat.ML / 2026/3/24

💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

要点

The paper shows that scaling Weight-Decomposed Low-Rank Adaptation (DoRA) is bottlenecked by computing row-wise norms that currently require materializing a dense product BA, causing very high transient memory use at large input dimensions and ranks.
It introduces a factored norm formulation that computes the required squared norm using only O(d_out·r + r^2) intermediates (base, cross, and Gram terms), avoiding dense-product formation.
It also presents fused Triton kernels that collapse multiple DoRA composition steps into a single pass, cutting memory traffic by about 4x and improving numerical stability in common near-unity rescaling regimes.
Experiments across six 8–32B vision-language models on multiple NVIDIA GPU generations show 1.5–2.0x faster inference and 1.5–1.9x faster gradient computation versus Hugging Face PEFT, with up to ~7 GB lower peak VRAM and near-identical outputs/training behavior.
Microbenchmarks further validate 1.5–2.7x speedups for compose-kernel operations across GPUs, with high final-logit cosine similarity and close loss deltas over long training runs.

Abstract

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.