vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models
arXiv cs.AI / 3/17/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- VLA eval is an open-source evaluation harness that decouples model inference from benchmark execution through a WebSocket msgpack protocol with Docker-based environment isolation.
- Models integrate once by implementing a single predict() method and benchmarks integrate via a four-method interface to enable a complete cross-evaluation matrix.
- The framework supports 13 simulation benchmarks and six model servers, and requires only two commands to run: vla eval serve and vla eval run.
- It delivers a 47x throughput improvement via episode sharding and batch inference, enabling 2000 LIBERO episodes to be evaluated in about 18 minutes.
- The authors perform a reproducibility audit across three benchmarks, uncovering undocumented requirements, ambiguous termination semantics, and hidden normalization statistics, and release a VLA leaderboard aggregating 657 published results across 17 benchmarks.




