CPUBone: Efficient Vision Backbone Design for Devices with Low Parallelization Capabilities

arXiv cs.AI / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that most vision backbone efficiency research targets highly parallel hardware, but CPU-based inference needs a different design approach that emphasizes high MACs per second (MACpS) to sustain low latency.
  • It evaluates two modifications to standard convolutions—grouped convolutions and smaller kernel sizes—that substantially reduce total MACs while aiming to preserve hardware efficiency.
  • Across experiments on multiple CPU devices, the authors show these convolution changes maintain high hardware efficiency despite lowering computational cost.
  • They introduce CPUBone, a new CPU-optimized vision backbone family that achieves strong speed–accuracy trade-offs across a range of CPU hardware.
  • CPUBone’s efficiency is reported to carry over to downstream tasks such as object detection and semantic segmentation, and the models/code are released on GitHub.

Abstract

Recent research on vision backbone architectures has predominantly focused on optimizing efficiency for hardware platforms with high parallel processing capabilities. This category increasingly includes embedded systems such as mobile phones and embedded AI accelerator modules. In contrast, CPUs do not have the possibility to parallelize operations in the same manner, wherefore models benefit from a specific design philosophy that balances amount of operations (MACs) and hardware-efficient execution by having high MACs per second (MACpS). In pursuit of this, we investigate two modifications to standard convolutions, aimed at reducing computational cost: grouping convolutions and reducing kernel sizes. While both adaptations substantially decrease the total number of MACs required for inference, sustaining low latency necessitates preserving hardware-efficiency. Our experiments across diverse CPU devices confirm that these adaptations successfully retain high hardware-efficiency on CPUs. Based on these insights, we introduce CPUBone, a new family of vision backbone models optimized for CPU-based inference. CPUBone achieves state-of-the-art Speed-Accuracy Trade-offs (SATs) across a wide range of CPU devices and effectively transfers its efficiency to downstream tasks such as object detection and semantic segmentation. Models and code are available at https://github.com/altair199797/CPUBone.