AI Navigate

Byzantine-Robust Optimization under $(L_0, L_1)$-Smoothness

arXiv cs.LG / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • We propose Byz-NSGDM, a normalized stochastic gradient descent with momentum for Byzantine-robust distributed optimization under (L0,L1)-smoothness.
  • The method combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing to handle adversaries and state-dependent gradient Lipschitz behavior.
  • The paper proves a convergence rate of O(K^{-1/4}) up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity.
  • Empirical results on heterogeneous MNIST, synthetic (L0,L1)-smooth optimization, and character-level language modeling with a small GPT model demonstrate robustness against diverse Byzantine attacks and an ablation study confirms stability across momentum and learning-rate choices.

Abstract

We consider distributed optimization under Byzantine attacks in the presence of (L_0,L_1)-smoothness, a generalization of standard L-smoothness that captures functions with state-dependent gradient Lipschitz constants. We propose Byz-NSGDM, a normalized stochastic gradient descent method with momentum that achieves robustness against Byzantine workers while maintaining convergence guarantees. Our algorithm combines momentum normalization with Byzantine-robust aggregation enhanced by Nearest Neighbor Mixing (NNM) to handle both the challenges posed by (L_0,L_1)-smoothness and Byzantine adversaries. We prove that Byz-NSGDM achieves a convergence rate of O(K^{-1/4}) up to a Byzantine bias floor proportional to the robustness coefficient and gradient heterogeneity. Experimental validation on heterogeneous MNIST classification, synthetic (L_0,L_1)-smooth optimization, and character-level language modeling with a small GPT model demonstrates the effectiveness of our approach against various Byzantine attack strategies. An ablation study further shows that Byz-NSGDM is robust across a wide range of momentum and learning rate choices.