Robust and Fast Training via Per-Sample Clipping
arXiv stat.ML / 5/5/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces PS-Clip-SGD, a robust gradient estimator that uses per-sample gradient clipping to improve training stability under heavy-tailed gradient noise.
- The authors provide theoretical results, showing optimal in-expectation convergence rates for non-convex optimization and high-probability convergence guarantees with only polylogarithmic overhead in failure probability.
- Experiments indicate that PS-Clip-SGD trains AlexNet on CIFAR-100 more effectively than both SGD with momentum and standard (global) gradient clipping, even after considering the extra compute from per-sample clipping.
- The study also finds that with gradient accumulation, clipping at the mini-batch level can improve performance with essentially no additional computational cost, challenging the common approach of clipping only after completing all accumulation steps.
Related Articles

Singapore's Fraud Frontier: Why AI Scam Detection Demands Regulatory Precision
Dev.to

First experience with Building Apps with Google AI Studio: Incredibly simple and intuitive.
Dev.to

Meta will use AI to analyze height and bone structure to identify if users are underage
TechCrunch

How AI is Changing the Way We Code in 2026: The Shift from Syntax to Strategy
Dev.to

13 CLAUDE.md Rules That Make AI Write Modern PHP (Not PHP 5 Resurrected)
Dev.to