[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

Reddit r/MachineLearning / 4/2/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The PhAIL (phail.ai) project introduces an open benchmark that evaluates VLA robot models on real DROID hardware for bin-to-bin order picking using production metrics like Units Per Hour (UPH) and Mean Time Between Failures (MTBF).
  • In blind tests across four fine-tuned models, the best autonomous model achieves only about 5% of human throughput, while MTBF shows failures are frequent enough that autonomy requires near-continuous babysitting.
  • A human teleoperation baseline on the same robot is far higher (UPH 330 vs. 18–65 for models), suggesting the dominant gap is policy quality rather than the robot’s physical capability.
  • The benchmark is designed for transparency and improvement: every run includes public synced video and telemetry, along with open datasets/training scripts and a submission pathway for new checkpoints.
  • The project plans to expand evaluations (e.g., adding NVIDIA DreamZero) and invites the community to propose additional real-world manipulation tasks beyond pick-and-place.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware.

I couldn't find honest numbers anywhere, so I built a benchmark.

Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use.

Results (full data with video and telemetry for every run at phail.ai):

Model UPH MTBF
OpenPI (pi0.5) 65 4.0 min
GR00T 60 3.5 min
ACT 44 2.8 min
SmolVLA 18 1.2 min
Teleop / Finetuning (human controlling same robot) 330
Human hands 1,331

OpenPI and GR00T are not statistically significant at current episode counts – we're collecting more runs.

The teleop baseline is the fairer comparison: same hardware, human in the loop. That's a 5x gap, and it's almost entirely policy quality – the robot can physically move much faster than any model commands it to. The human-hands number is what warehouse operators compare against when deciding whether to deploy.

The MTBF numbers are arguably more telling than UPH. At 4 minutes between failures, "autonomous operation" means a full-time babysitter. Reliability needs to cross a threshold before autonomy has economic value.

Every run is public with synced video and telemetry. Fine-tuning dataset, training scripts, and submission pathway are all open. If you think your model or fine-tuning recipe can do better, submit a checkpoint.

What models are we missing? We're adding NVIDIA DreamZero next. If you have a checkpoint that works on DROID hardware, submit it – or tell us what you'd want to see evaluated. What tasks beyond pick-and-place would be the real test for general-purpose manipulation?

More:

submitted by /u/svertix
[link] [comments]