One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries
arXiv cs.CL / 3/13/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper proposes a central Supervisor architecture for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities.
- It introduces RouteLLM for learned routing of text queries and SLM-assisted modality decomposition for non-text paths to dynamically assign subtasks to appropriate tools.
- Evaluation on 2,847 queries across 15 task categories shows a 72% reduction in time-to-accurate-answer, an 85% reduction in conversational rework, and a 67% reduction in cost compared with a matched hierarchical baseline.
- The results indicate that centralized orchestration can substantially improve multimodal AI deployment economics while preserving accuracy parity.


![[Boost]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D800%252Cheight%3D%252Cfit%3Dscale-down%252Cgravity%3Dauto%252Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Fuser%252Fprofile_image%252F3833034%252F44fa15e0-8eb9-4843-a424-a4a7b3538f43.jpeg&w=3840&q=75)