[R] How are you managing long-running preprocessing jobs at scale? Curious what's actually working

Reddit r/MachineLearning / 2026/3/25

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical Usage

共有:

要点

The post describes a common ML workflow pain point: large (50–100GB) preprocessing jobs running on a single machine can take hours and are especially painful to recover from when they fail mid-way.
The author notes they evaluated orchestration tools like Prefect and Temporal but found they seem to require substantial DevOps effort to set up and maintain.
They ask the community whether teams still run preprocessing on single machines or distribute jobs across multiple workers, and what tooling is used if distributing.
The discussion focuses on whether internal job-management systems are worth the overhead and what the biggest operational failure points are in current setups.
Overall, the goal is to determine whether the team is solving the wrong problem or whether these operational challenges are broadly shared across ML teams.

We're a small ML team for a project and we keep running into the same wall: large preprocessing jobs (think 50–100GB datasets) running on a single machine take hours, and when something fails halfway through, it's painful.

We've looked at Prefect, Temporal, and a few others — but they all feel like they require a full-time DevOps person to set up and maintain properly. And most of our team is focused on the models, not the infrastructure.

Curious how other teams are handling this:

- Are you distributing these jobs across multiple workers, or still running on single machines?

- If you are distributing — what are you using and is it actually worth the setup overhead?

- Has anyone built something internal to handle this, and was it worth it?

- What's the biggest failure point in your current setup?

Trying to figure out if we're solving this the wrong way or if this is just a painful problem everyone deals with. Would love to hear what's actually working for people.

submitted by /u/krishnatamakuwala
[link] [comments]

Microsoftも実証「中空コア光ファイバー」、空気でガラスの限界突破へ

日経XTECH

OpenHandsのツールやサービスを把握、WSLで開発環境を構築しよう

日経XTECH

AI時代のトラフィックはますます予測困難に、NaaSによる柔軟な制御が不可欠

日経XTECH

デジタルアレルギー社員が、Excel集計「40分→2分」に　イオン流「現場DX」の極意

ITmedia AI+

マテリアルエクスプローラ

Qiita

[R] How are you managing long-running preprocessing jobs at scale? Curious what's actually working

要点

関連記事

Microsoftも実証「中空コア光ファイバー」、空気でガラスの限界突破へ

OpenHandsのツールやサービスを把握、WSLで開発環境を構築しよう

AI時代のトラフィックはますます予測困難に、NaaSによる柔軟な制御が不可欠

デジタルアレルギー社員が、Excel集計「40分→2分」に　イオン流「現場DX」の極意

マテリアルエクスプローラ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

要点

関連記事

Microsoftも実証「中空コア光ファイバー」、空気でガラスの限界突破へ

OpenHandsのツールやサービスを把握、WSLで開発環境を構築しよう

AI時代のトラフィックはますます予測困難に、NaaSによる柔軟な制御が不可欠

デジタルアレルギー社員が、Excel集計「40分→2分」に イオン流「現場DX」の極意

マテリアルエクスプローラ

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

デジタルアレルギー社員が、Excel集計「40分→2分」に　イオン流「現場DX」の極意