SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

arXiv cs.CV / 4/24/2026

📰 NewsModels & Research

共有:

Key Points

SpatiO introduces a heterogeneous multi-agent framework for spatial reasoning in vision-language tasks, designed to better handle varying reliability of depth, geometry, and 2D appearance cues across contexts.
The paper proposes Test-Time Orchestration (TTO), an inference-time optimization that dynamically evaluates and reweights different specialized agents without updating model parameters.
By coordinating multiple “vision-language specialists” with complementary inductive biases, SpatiO aims to overcome limitations of single-pipeline methods that implicitly fix a spatial prior.
Experiments across several spatial reasoning benchmarks (3DSRBench, STVQA-7k, CV-Bench, Omni3D-Bench) show consistent performance gains over both closed-source and open-source baselines.

Abstract

Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce \textbf{\textsc{SpatiO}}, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose \textbf{Test-Time Orchestration (TTO)}, an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that \textsc{SpatiO} consistently improves spatial reasoning performance over both closed-source and open-source baselines.

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Dev.to

DeepSeek-V4 Runs on Huawei Ascend Chips at 85% Utilization — Here's What That Means for AI Infrastructure and Pricing

Dev.to

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning

Key Points

Abstract

Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

DeepSeek-V4 Runs on Huawei Ascend Chips at 85% Utilization — Here's What That Means for AI Infrastructure and Pricing

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer