Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels
arXiv stat.ML / 2026/3/24
💬 オピニオンDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
要点
- The paper shows that scaling Weight-Decomposed Low-Rank Adaptation (DoRA) is bottlenecked by computing row-wise norms that currently require materializing a dense product BA, causing very high transient memory use at large input dimensions and ranks.
- It introduces a factored norm formulation that computes the required squared norm using only O(d_out·r + r^2) intermediates (base, cross, and Gram terms), avoiding dense-product formation.
- It also presents fused Triton kernels that collapse multiple DoRA composition steps into a single pass, cutting memory traffic by about 4x and improving numerical stability in common near-unity rescaling regimes.
- Experiments across six 8–32B vision-language models on multiple NVIDIA GPU generations show 1.5–2.0x faster inference and 1.5–1.9x faster gradient computation versus Hugging Face PEFT, with up to ~7 GB lower peak VRAM and near-identical outputs/training behavior.
- Microbenchmarks further validate 1.5–2.7x speedups for compose-kernel operations across GPUs, with high final-logit cosine similarity and close loss deltas over long training runs.




