Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

arXiv cs.LG / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

We propose Byz-NSGDM, a normalized stochastic gradient descent with momentum for Byzantine-robust distributed optimization under (L0,L1)-smoothness.
The method combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing to handle adversaries and state-dependent gradient Lipschitz behavior.
The paper proves a convergence rate of O(K^{-1/4}) up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity.
Empirical results on heterogeneous MNIST, synthetic (L0,L1)-smooth optimization, and character-level language modeling with a small GPT model demonstrate robustness against diverse Byzantine attacks and an ablation study confirms stability across momentum and learning-rate choices.

Abstract

We consider distributed optimization under Byzantine attacks in the presence of

(L_0,L_1)

-smoothness, a generalization of standard

L

-smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by

(L_0,L_1)

-smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of

O(K^{-1/4})

up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic

(L_0,L_1)

-smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.