[R] How are you managing long-running preprocessing jobs at scale? Curious what's actually working

Reddit r/MachineLearning / 2026/3/25

💬 オピニオンDeveloper Stack & InfrastructureTools & Practical Usage

要点

  • The post describes a common ML workflow pain point: large (50–100GB) preprocessing jobs running on a single machine can take hours and are especially painful to recover from when they fail mid-way.
  • The author notes they evaluated orchestration tools like Prefect and Temporal but found they seem to require substantial DevOps effort to set up and maintain.
  • They ask the community whether teams still run preprocessing on single machines or distribute jobs across multiple workers, and what tooling is used if distributing.
  • The discussion focuses on whether internal job-management systems are worth the overhead and what the biggest operational failure points are in current setups.
  • Overall, the goal is to determine whether the team is solving the wrong problem or whether these operational challenges are broadly shared across ML teams.

We're a small ML team for a project and we keep running into the same wall: large preprocessing jobs (think 50–100GB datasets) running on a single machine take hours, and when something fails halfway through, it's painful.

We've looked at Prefect, Temporal, and a few others — but they all feel like they require a full-time DevOps person to set up and maintain properly. And most of our team is focused on the models, not the infrastructure.

Curious how other teams are handling this:

- Are you distributing these jobs across multiple workers, or still running on single machines?

- If you are distributing — what are you using and is it actually worth the setup overhead?

- Has anyone built something internal to handle this, and was it worth it?

- What's the biggest failure point in your current setup?

Trying to figure out if we're solving this the wrong way or if this is just a painful problem everyone deals with. Would love to hear what's actually working for people.

submitted by /u/krishnatamakuwala
[link] [comments]