Rocks, Pebbles and Sand: Modality-aware Scheduling for Multimodal Large Language Model Inference
arXiv cs.AI / 3/30/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that multimodal LLM inference workloads (text, images, videos) have drastically different resource needs, leading to latency spikes and head-of-line blocking when served by text-optimized systems.
- It introduces a simple “modality as workload size” abstraction—videos as rocks, images as pebbles, and text as sand—to guide scheduling decisions.
- RPS-Serve is proposed as a modality-aware scheduler that classifies requests, dynamically prioritizes them, and uses aging to prevent starvation of heavier workloads.
- Experiments on state-of-the-art MLLMs show RPS-Serve reduces average time-to-first-token (TTFT) by 54% overall and by 78.5% for latency-critical requests versus current serving systems.
- The study frames the result as achieving more “LLM-like” interactive responsiveness for multimodal LLMs while improving overall resource utilization.
Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere
Dev.to