Shard the Gradient, Scale the Model: Serverless Federated Aggregation via Gradient Partitioning

arXiv cs.AI / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a key scalability bottleneck for federated learning on serverless platforms: existing designs require each aggregator to hold the full model gradient in memory, which fails once gradients exceed per-function memory limits like AWS Lambda’s ~10 GB.
  • It proposes GradsSharding, which partitions the gradient tensor into M shards and averages each shard independently in separate serverless functions that still receive contributions from all clients.
  • The authors claim FedAvg’s element-wise nature makes the sharded approach produce bit-identical aggregation results to tree-based methods, so model accuracy is invariant by construction.
  • Experiments on HPC and real AWS Lambda deployments (43 MB to 5 GB gradients/models) show a cost crossover around 500 MB, about a 2.7x cost reduction at VGG-16 scale, and the ability to aggregate beyond serverless memory ceilings where prior architectures cannot.

Abstract

Federated learning (FL) aggregation on serverless platforms faces a hard scalability ceiling: existing architectures (lambda-FL, LIFL) partition clients across aggregators, but every aggregator must hold the complete model gradient in memory. When gradients exceed the per-function memory limit (e.g., 10 GB on AWS Lambda), aggregation becomes infeasible regardless of tree depth or branching factor. We propose GradsSharding, which instead partitions the gradient tensor into M shards, each averaged independently by a serverless function that receives contributions from all clients. Because FedAvg averaging is element-wise, this produces bit-identical results to tree-based approaches, so model accuracy is invariant by construction. Per-function memory is bounded at O(|{\theta}|/M), independent of client count, enabling aggregation of arbitrarily large models. We evaluate GradsSharding against lambda-FL and LIFL through HPC experiments and real AWS Lambda deployments across model sizes from 43 MB to 5 GB. Results show a cost crossover at approximately 500 MB gradient size, 2.7x cost reduction at VGG-16 scale, and that GradsSharding is the only architecture that remains deployable beyond the serverless memory ceiling.