Cloudless-Training: A Framework to Improve Efficiency of Geo-Distributed ML Training
arXiv cs.AI / 4/28/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper addresses inefficiencies in geo-distributed ML training, mainly caused by missing elastic scheduling for multi-regional cloud resources and WAN communication overhead with bandwidth limits and fluctuations.
- It proposes “Cloudless-Training,” a PS-based framework that uses a two-layer control/physical training architecture to enable elastic scheduling and WAN-aware communication in a serverless manner.
- Cloudless-Training introduces an elastic scheduling strategy that adapts training workflows to heterogeneous cloud resources and to the location/distribution of pre-existing training datasets.
- It also contributes two synchronization methods across clouds—ASGD-GA (asynchronous SGD with gradient accumulation) and inter-PS model averaging (MA)—to improve partition coordination while maintaining model correctness.
- Implemented with OpenFaaS and evaluated on Tencent Cloud, the approach shows substantial gains, including 9.2%–24.0% training cost reduction and up to 1.7× synchronization/training speedup versus baseline.
Related Articles
How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI
MarkTechPost

An improvement of the convergence proof of the ADAM-Optimizer
Dev.to
Claude Code 会话历史在哪里?如何找回你的 AI 编程对话记录
Dev.to
We built an AI that runs an entire business autonomously. Not a demo. Not a prototype. Actually running. YC-backed, here's what we learned.
Reddit r/artificial
langchain-tests==1.1.7
LangChain Releases