DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference
arXiv cs.CV / 5/4/2026
💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes “Split CNN Inference,” partitioning convolutional neural network workloads between a DPU and a GPU to reduce edge-device latency for video/image streaming.
- The approach runs the early CNN layers on the Versal VCK190’s DPU near the data source, then asynchronously pipelines the remaining layers on an NVIDIA RTX 2080 to limit overall latency.
- It introduces a GNN-based partition index prediction method to automatically choose how to split layers across devices rather than requiring manual partitioning.
- Experiments on models including LeNet-5, ResNet variants, VGG16, and MobileNetv2 show up to 2.48× lower latency than DPU-only and up to 3.37× lower latency than GPU-only execution, with the trained GNN achieving 96.27% split-accuracy.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to