We’re training on a cluster in Lambda Labs, but our main dataset ( over 40TB) is sitting in AWS S3. The egress fees are high, so we tried to do it off Cloudflare R2. The problem is R2’s TTFB is all over the place, and our data loader is constantly waiting on I/O. Then the GPUs are unused for 20% of the epoch.
Is there a zero-egress alternative that actually has the throughput/latency for high-speed streaming? Or are we stuck building a custom NVMe cache layer?
[link] [comments]



