On-Device Vision Training, Deployment, and Inference on a Thumb-Sized Microcontroller

arXiv cs.LG / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes an end-to-end on-device vision ML pipeline that covers data acquisition, training of a small two-layer CNN with Adam optimization, and real-time inference, all running directly on a microcontroller-class device costing $15–40.
It reports efficient deployment and inference on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), including 64×64 three-class image classification with ~9 minutes per training run and ~6.3 FPS inference.
The authors emphasize practical microcontroller engineering by providing mechanisms such as correct batch-level gradient accumulation, precomputed resize lookup tables, PSRAM-aware memory management, and a single-constant network reconfiguration interface.
For deployment convenience without SD cards, the system supports baked-in weight export using a dual-format scheme and an automated three-tier weight priority resolution at boot (SD binary > baked-in header > He initialization).
All source code and reference datasets are released under the MIT License, aiming to make the full ML lifecycle transparent and reproducible in a compact C++ implementation (~1,750 lines, Arduino IDE build < 1 minute).

Abstract

This paper presents a complete, end-to-end on-device vision machine learning pipeline, comprising data acquisition, two-layer CNN training with Adam optimization, and real-time inference, executing entirely on a microcontroller-class device costing $15-40 USD. Unlike cloud-based workflows that require external infrastructure and conceal the computational pipeline from the practitioner, this system implements every step of the core ML lifecycle in approximately 1,750 lines of readable C++ that compiles in under one minute using the Arduino IDE, with no external ML dependencies. Running on the Seeed Studio ESP32-S3 XIAO ML Kit (8 MB PSRAM), the firmware achieves three-class 64x64 image classification in approximately 9 minutes per training run, with real-time inference at 6.3 FPS. Key contributions include: correct batch-level gradient accumulation; pre-computed resize lookup tables for inference; dual-format weight export for SD-free baked-in deployment; a three-tier weight priority system (SD binary > baked-in header > He-initialization) resolved automatically at boot; a single-constant network reconfiguration interface; and PSRAM-aware memory management suited to microcontroller constraints. All source code and reference datasets are released under the MIT License at https://github.com/webmcu-ai/on-device-vision-ai

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/28DailyView insight →

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust

Dev.to

Whatsapp AI booking system in one prompt in 5 minutes

Dev.to

v0.22.1

Ollama Releases

Launching TotalMedia: A Simpler Way to Fix and Convert Video Files

Dev.to

The best of Cloud Next '26: Gemini Enterprise Agent Platform. The perfect combination of Intelligence and Automation to generate VALUE.

Dev.to

On-Device Vision Training, Deployment, and Inference on a Thumb-Sized Microcontroller

Key Points

Abstract

💡 Insights using this article

Related Articles

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust

Whatsapp AI booking system in one prompt in 5 minutes

v0.22.1

Launching TotalMedia: A Simpler Way to Fix and Convert Video Files

The best of Cloud Next '26: Gemini Enterprise Agent Platform. The perfect combination of Intelligence and Automation to generate VALUE.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer