SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

SpecEyes targets the high latency (“agentic depth”) in agentic multimodal LLMs caused by cascaded perception–reasoning–tool-calling loops.
The method uses a lightweight, tool-free MLLM as a speculative planner to predict an execution trajectory, allowing early termination of expensive tool chains when they are unlikely to be needed.
It introduces a cognitive gating mechanism based on answer separability to decide when to trust self-verification, avoiding reliance on oracle labels.
SpecEyes adds a heterogeneous parallel funnel that runs the small model’s speculative steps concurrently while the large model remains serial, improving end-to-end throughput.
Experiments on V* Bench, HR-Bench, and POPE report 1.1–3.35x speedups with accuracy preserved or improved (up to +6.7%), particularly benefiting concurrent serving workloads.

Abstract

Agentic multimodal large language models (MLLMs) (e.g., OpenAI o3 and Gemini Agentic Vision) achieve remarkable reasoning capabilities through iterative visual tool invocation. However, the cascaded perception, reasoning, and tool-calling loops introduce significant sequential overhead. This overhead, termed agentic depth, incurs prohibitive latency and seriously limits system-level concurrency. To this end, we propose SpecEyes, an agentic-level speculative acceleration framework that breaks this sequential bottleneck. Our key insight is that a lightweight, tool-free MLLM can serve as a speculative planner to predict the execution trajectory, enabling early termination of expensive tool chains without sacrificing accuracy. To regulate this speculative planning, we introduce a cognitive gating mechanism based on answer separability, which quantifies the model's confidence for self-verification without requiring oracle labels. Furthermore, we design a heterogeneous parallel funnel that exploits the stateless concurrency of the small model to mask the stateful serial execution of the large model, maximizing system throughput. Extensive experiments on V* Bench, HR-Bench, and POPE demonstrate that SpecEyes achieves 1.1-3.35x speedup over the agentic baseline while preserving or even improving accuracy (up to +6.7%), thereby boosting serving throughput under concurrent workloads.

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

Dev.to

AI Shields Your Money: Banks’ New Fraud Fighters

Dev.to

Building AI Phone Systems for Veterinary Clinics — What Actually Works

Dev.to

How to Use Instagram Reels to Boost Sales [2026 Strategy]

Dev.to

[R] Adversarial Machine Learning

Reddit r/MachineLearning

SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Key Points

Abstract

Related Articles

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

AI Shields Your Money: Banks’ New Fraud Fighters

Building AI Phone Systems for Veterinary Clinics — What Actually Works

How to Use Instagram Reels to Boost Sales [2026 Strategy]

[R] Adversarial Machine Learning

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer