[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

Reddit r/MachineLearning / 4/2/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The PhAIL (phail.ai) project introduces an open benchmark that evaluates VLA robot models on real DROID hardware for bin-to-bin order picking using production metrics like Units Per Hour (UPH) and Mean Time Between Failures (MTBF).
In blind tests across four fine-tuned models, the best autonomous model achieves only about 5% of human throughput, while MTBF shows failures are frequent enough that autonomy requires near-continuous babysitting.
A human teleoperation baseline on the same robot is far higher (UPH 330 vs. 18–65 for models), suggesting the dominant gap is policy quality rather than the robot’s physical capability.
The benchmark is designed for transparency and improvement: every run includes public synced video and telemetry, along with open datasets/training scripts and a submission pathway for new checkpoints.
The project plans to expand evaluations (e.g., adding NVIDIA DreamZero) and invites the community to propose additional real-world manipulation tasks beyond pick-and-place.

I spent the last year trying to answer a simple question: how good are VLA models on real commercial tasks? Not demos, not simulation, not success rates on 10 tries. Actual production metrics on real hardware.

I couldn't find honest numbers anywhere, so I built a benchmark.

Setup: DROID platform, bin-to-bin order picking – one of the most common warehouse and industrial operations. Four models fine-tuned on the same real-robot dataset, evaluated blind (the operator doesn't know which model is running). We measure Units Per Hour (UPH) and Mean Time Between Failures (MTBF) – the metrics operations people actually use.

Results (full data with video and telemetry for every run at phail.ai):

Model	UPH	MTBF
OpenPI (pi0.5)	65	4.0 min
GR00T	60	3.5 min
ACT	44	2.8 min
SmolVLA	18	1.2 min
Teleop / Finetuning (human controlling same robot)	330	–
Human hands	1,331	–

OpenPI and GR00T are not statistically significant at current episode counts – we're collecting more runs.

The teleop baseline is the fairer comparison: same hardware, human in the loop. That's a 5x gap, and it's almost entirely policy quality – the robot can physically move much faster than any model commands it to. The human-hands number is what warehouse operators compare against when deciding whether to deploy.

The MTBF numbers are arguably more telling than UPH. At 4 minutes between failures, "autonomous operation" means a full-time babysitter. Reliability needs to cross a threshold before autonomy has economic value.

Every run is public with synced video and telemetry. Fine-tuning dataset, training scripts, and submission pathway are all open. If you think your model or fine-tuning recipe can do better, submit a checkpoint.

What models are we missing? We're adding NVIDIA DreamZero next. If you have a checkpoint that works on DROID hardware, submit it – or tell us what you'd want to see evaluated. What tasks beyond pick-and-place would be the real test for general-purpose manipulation?

More:

Leaderboard + full episode data: phail.ai
White paper: phail.ai/whitepaper.pdf
Open-source toolkit: github.com/Positronic-Robotics/positronic
Detailed findings: positronic.ro/introducing-phail

submitted by /u/svertix
[link] [comments]

Black Hat USA

AI Business

YouTube adds new podcast features, including an AI recommendation tool and ‘Auto speed’

TechCrunch

I'm Tired of Talking to AI, Microsoft starts canceling Claude Code licenses and many other AI links from Hacker News

Reddit r/artificial

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Reddit r/MachineLearning

Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B

Reddit r/LocalLLaMA

[P] PhAIL (phail.ai) – an open benchmark for robot AI on real hardware. Best model: 5% of human throughput, needs help every 4 minutes.

Key Points

Related Articles

Black Hat USA

YouTube adds new podcast features, including an AI recommendation tool and ‘Auto speed’

I'm Tired of Talking to AI, Microsoft starts canceling Claude Code licenses and many other AI links from Hacker News

Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]

Built a config sweep CLI for llama.cpp and vLLM and found out Q4_K_M beat Q8_0 by 230ms TTFT on Qwen2.5-7B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer