Anyone have an S3-compatible store that actually saturates H100s without the AWS egress tax? [R]

Reddit r/MachineLearning / 4/9/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A team training on H100 GPUs in Lambda Labs is struggling with high AWS S3 egress costs for a >40TB dataset, leading them to consider alternatives.
  • They attempted using S3-compatible Cloudflare R2, but inconsistent TTFB causes their data loaders to stall, leaving GPUs underutilized for about 20% of training.
  • The post raises the question of whether there is any “zero-egress” (or low-egress) storage option that can sustain the latency/throughput needed for high-speed streaming training.
  • The implied solution direction is potentially building a custom NVMe caching layer to hide storage latency and keep the GPUs saturated.
  • The discussion frames the issue as an infrastructure/throughput bottleneck rather than a model-training problem, emphasizing end-to-end data pipeline performance.

We’re training on a cluster in Lambda Labs, but our main dataset ( over 40TB) is sitting in AWS S3. The egress fees are high, so we tried to do it off Cloudflare R2. The problem is R2’s TTFB is all over the place, and our data loader is constantly waiting on I/O. Then the GPUs are unused for 20% of the epoch.

Is there a zero-egress alternative that actually has the throughput/latency for high-speed streaming? Or are we stuck building a custom NVMe cache layer?

submitted by /u/regentwells
[link] [comments]