Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

arXiv cs.LG / 3/20/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

Tula is an online service that automatically optimizes training time, cost, and convergence quality for large-batch distributed training of convolutional models using a combination of parallel-systems modeling and statistical performance prediction.
It predicts training time and cost with 7.5-14% error across multiple models, enabling identification of the optimal batch-size for given resources and data.
It achieves up to 20x speedup and about 9% average improvement in test accuracy over standard large-batch training on various vision tasks, addressing the generalization gap.
The method mitigates the knee-point in the time/cost vs batch-size Pareto curve caused by communication overhead and memory limits, rather than simply increasing batch-size.
By optimizing batch-size automatically, Tula reduces training costs and speeds up experimentation, informing infrastructure and scheduling decisions for distributed ML workloads.

Abstract

Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and convergence quality for large-batch training of convolutional models. It combines parallel-systems modeling with statistical performance prediction to identify the optimal batch-size. Tula predicts training time and cost within 7.5-14% error across multiple models, and achieves up to 20x overall speedup and improves test accuracy by 9% on average over standard large-batch training on various vision tasks, thus successfully mitigating the generalization gap and accelerating training at the same time.

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.

Reddit r/LocalLLaMA

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

Dev.to

AI Cybersecurity

Dev.to

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

Dev.to

Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Key Points

Abstract

Related Articles

I built an autonomous AI Courtroom using Llama 3.1 8B and CrewAI running 100% locally on my 5070 Ti. The agents debate each other through contextual collaboration.

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

The Honest Guide to AI Writing Tools in 2026 (What Actually Works)

AI Cybersecurity

Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer