Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

arXiv cs.AI / 3/16/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper analyzes multimodal LLM inference and shows that partitioning at the vision encoder–language model boundary minimizes cross-device transfer across partition points while preserving stage-based execution.
It introduces HeteroServe, a phase-aware runtime for modality-level partitioning and cross-tier scheduling, achieving up to 54% throughput gains on 4x A100 and 37% tokens-per-dollar improvement under a fixed budget.
The approach reduces inter-device data transfer from O(L * s_ctx) to O(N_v * d), enabling cost-effective PCIe-based deployment instead of high-bandwidth interconnects like NVLink.
Evaluation on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0 shows results that scale with model depth and lead to predicted 31.4% cost savings (observed 40.6%).

Abstract

Multimodal large language model (MLLM) inference splits into two phases with opposing hardware demands: vision encoding is compute-bound, while language generation is memory-bandwidth-bound. We show that under standard transformer KV caching, the modality boundary (between vision encoder and language model) minimizes cross-device transfer among all partition points that preserve standard stage-based execution. Partitioning here reduces transfer complexity from

O(L * s_ctx)

bytes (GB-scale KV caches under stage-level disaggregation) to

O(N_v * d)

bytes (MB-scale embeddings), an O(L) reduction where L is the transformer depth. The result holds across attention mechanisms (MHA/GQA), dynamic vision resolutions, and model scales, and the advantage grows as models deepen. A direct implication is that existing stage-level disaggregation systems are constrained to high-bandwidth interconnects (e.g., NVLink), whereas modality-level disaggregation enables cross-tier heterogeneous serving over commodity PCIe. A closed-form cost model shows that heterogeneous deployment is cost-optimal under phase-separable workloads (predicts 31.4% savings; observed 40.6%). We build HeteroServe, a phase-aware runtime with modality-level partitioning and cross-tier scheduling, and evaluate it on LLaVA-1.5-7B and Qwen2.5-VL against vLLM v0.3.0. On identical 4xA100 hardware, engine optimizations raise throughput by up to 54%. Under a fixed budget, a heterogeneous cluster (\$38k) improves Tokens/\$ by 37% over a homogeneous baseline (\$64k) without degrading latency.