A Comparative Study of CNN Optimization Methods for Edge AI: Exploring the Role of Early Exits

arXiv cs.AI / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The study compares two main strategies for running CNNs on edge devices—static compression (pruning and quantization) and dynamic computation (early-exit mechanisms)—under realistic, identical conditions.
  • Unlike prior work that often evaluates these approaches in isolation, the authors run ONNX-based inference pipelines on real edge hardware to produce deployment-oriented evidence.
  • The results indicate that pruning and quantization consistently reduce memory footprint, but they cannot adapt computation to each input’s difficulty the way early exits can.
  • Early-exit mechanisms provide input-adaptive latency and compute savings, enabling performance improvements that static methods alone cannot achieve.
  • Combining static compression with early exits can jointly lower inference latency and memory usage while incurring minimal accuracy loss, broadening feasible edge deployment outcomes.

Abstract

Deploying deep neural networks on edge devices requires balancing accuracy, latency, and resource constraints under realistic execution conditions. To fit models within these constraints, two broad strategies have emerged: static compression techniques such as pruning and quantization, which permanently reduce model size, and dynamic approaches such as early-exit mechanisms, which adapt computational cost at runtime. While both families are widely studied in isolation, they are rarely compared under identical conditions on physical hardware. This paper presents a unified deployment-oriented comparison of static compression and dynamic early-exit mechanisms, evaluated on real edge devices using ONNX based inference pipelines. Our results show that static and dynamic techniques offer fundamentally different trade-offs for edge deployment. While pruning and quantization deliver consistent memory footprint reduction, early-exit mechanisms enable input-adaptive computation savings that static methods cannot match. Their combination proves highly effective, simultaneously reducing inference latency and memory usage with minimal accuracy loss, expanding what is achievable at the edge.