SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
arXiv cs.CV / 4/24/2026
📰 NewsModels & Research
Key Points
- SpatiO introduces a heterogeneous multi-agent framework for spatial reasoning in vision-language tasks, designed to better handle varying reliability of depth, geometry, and 2D appearance cues across contexts.
- The paper proposes Test-Time Orchestration (TTO), an inference-time optimization that dynamically evaluates and reweights different specialized agents without updating model parameters.
- By coordinating multiple “vision-language specialists” with complementary inductive biases, SpatiO aims to overcome limitations of single-pipeline methods that implicitly fix a spatial prior.
- Experiments across several spatial reasoning benchmarks (3DSRBench, STVQA-7k, CV-Bench, Omni3D-Bench) show consistent performance gains over both closed-source and open-source baselines.
Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com
Dev.to
DeepSeek-V4 Runs on Huawei Ascend Chips at 85% Utilization — Here's What That Means for AI Infrastructure and Pricing
Dev.to