Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper addresses the difficulty of running Vision-Language Models (VLMs) on resource-constrained edge devices and the latency costs of sending raw images to the cloud over limited bandwidth.
  • It proposes a progressive semantic communication framework that compresses visual tokens into adaptive, progressively refinable representations using a Meta AutoEncoder.
  • The approach is designed to work as a “plug-and-play” layer with off-the-shelf VLMs without requiring additional fine-tuning.
  • By transmitting information at different semantic levels, the system enables a tunable trade-off between communication cost and semantic fidelity under changing network conditions.
  • Experiments with an end-to-end edge-cloud setup (NXP i.MX95 edge and a GPU server) show substantially lower latency at 1 Mbps uplink than full-edge or full-cloud, while preserving high semantic consistency under high compression, and the authors plan to release code.

Abstract

Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading inference to the cloud is often impractical in bandwidth-limited environments, where transmitting raw visual data introduces substantial latency overhead. While recent edge-cloud collaborative architectures attempt to partition VLM workloads across devices, they typically rely on transmitting fixed-size representations, lacking adaptability to dynamic network conditions and failing to fully exploit semantic redundancy. In this paper, we propose a progressive semantic communication framework for edge-cloud VLM inference, using a Meta AutoEncoder that compresses visual tokens into adaptive, progressively refinable representations, enabling plug-and-play deployment with off-the-shelf VLMs without additional fine-tuning. This design allows flexible transmission at different information levels, providing a controllable trade-off between communication cost and semantic fidelity. We implement a full end-to-end edge-cloud system comprising an embedded NXP i.MX95 platform and a GPU server, communicating over bandwidth-constrained networks. Experimental results show that, at 1 Mbps uplink, the proposed progressive scheme significantly reduces network latency compared to full-edge and full-cloud solutions, while maintaining high semantic consistency even under high compression. The implementation code will be released upon publication at https://github.com/open-ep/ProSemComVLM.