SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

arXiv cs.CV / 4/23/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces SceneOrchestra, a trainable framework for agentic 3D scene synthesis that improves upon LLM-orchestrated tool workflows that rely on step-by-step execute–review–reflect loops.
  • It identifies two main shortcomings in prior methods: heuristic-driven next-tool/parameter decisions that can waste calls and reduce quality, and added latency from rendering and reviewing intermediate outputs after every step.
  • SceneOrchestra optimizes the entire tool-call execution flow by generating complete tool-call trajectories in one shot and using a discriminator to evaluate full trajectories and choose the best candidate.
  • The approach uses a two-phase training strategy (trajectory learning plus discriminator trajectory-quality training, followed by interleaved adaptation/distillation) and, during inference, runs only the orchestrator to execute full trajectories.
  • Experiments report state-of-the-art 3D scene quality while also lowering runtime relative to previous methods, indicating both better efficiency and output fidelity.

Abstract

Recent agentic frameworks for 3D scene synthesis have advanced realism and diversity by integrating heterogeneous generation and editing tools. These tools are organized into workflows orchestrated by an off-the-shelf LLM. Current approaches typically adopt an execute-review-reflect loop: at each step, the orchestrator executes a tool, renders intermediate results for review, and then decides on the tool and its parameters for the next step. However, this design has two key limitations. First, next-step tool selection and parameter configuration are driven by heuristic rules, which can lead to suboptimal execution flows, unnecessary tool invocations, degraded output quality, and increased runtime. Second, rendering and reviewing intermediate results after each step introduces additional latency. To address these issues, we propose SceneOrchestra, a trainable orchestration framework that optimizes the tool-call execution flow and eliminates the step-by-step review loop, improving both efficiency and output quality. SceneOrchestra consists of an orchestrator and a discriminator, which we fine-tune with a two-phase training strategy. In the first phase, the orchestrator learns context-aware tool selection and complete tool-call trajectory generation, while the discriminator is trained to assess the quality of full trajectories, enabling it to select the best trajectory from multiple candidates. In the second phase, we perform interleaved training, where the discriminator adapts to the orchestrator's evolving trajectory distribution and distills its discriminative capability back into the orchestrator. At inference, we only use the orchestrator to generate and execute full tool-call trajectories from instructions, without requiring the discriminator. Extensive experiments show that our method achieves state-of-the-art scene quality while reducing runtime compared to previous work.