LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
arXiv cs.RO / 4/21/2026
📰 NewsModels & Research
Key Points
- The paper introduces LongBench, a real-world benchmark with 1,000+ robotic manipulation episodes to study why long-horizon policies degrade during extended execution.
- LongBench covers two evaluation regimes—Context-Independent (fully observable) and Context-Dependent (ambiguity-driven)—to separate different sources of temporal difficulty.
- The benchmark organizes tasks into capability- and ambiguity-specific subsets, enabling mechanism-aware analysis of robustness, temporal consistency, and context-dependent reasoning.
- Experiments with six state-of-the-art policies show that long-horizon performance is influenced by multiple factors rather than a single dominant cause.
- In fully observable settings, execution robustness correlates more strongly with performance, while context-related difficulty varies by task and is not consistently improved by memory-based methods.
Related Articles

We built it during the NVIDIA DGX Spark Full-Stack AI Hackathon — and it ended up winning 1st place overall 🏆
Dev.to

Stop Losing Progress: Setting Up a Pro Jupyter Workflow in VS Code (No More Colab Timeouts!)
Dev.to

Building AgentOS: Why I’m Building the AWS Lambda for Insurance Claims
Dev.to

Where we are. In a year, everything has changed. Kimi - Minimax - Qwen - Gemma - GLM
Reddit r/LocalLLaMA
Where is Grok-2 Mini and Grok-3 (mini)?
Reddit r/LocalLLaMA