I Ran Five Small Multimodal Models on a Jetson. The Fastest One Was Not the Best Baseline.
Dev.to / 6/18/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author built WearEdge Pro, a wearable industrial edge AI runtime that outputs structured “action cards” (not chat) with audit trails, mode boundaries, and human confirmation for workflows like maintenance and safety.
- They benchmarked five compact multimodal models on a Jetson using the same image/text prompts and a fixed “gateway” budget (560 image tokens, plus an extra 1024-token pass for Qwen2.5-VL to improve grounding).
- Gemma 4 E2B produced the strongest overall baseline behavior and was the “best product baseline,” while Qwen2.5-VL-3B was the best challenger with particularly strong changeover OCR and useful IQC defect scoring.
- SmolVLM2-2.2B was the fastest but often returned overly generic or placeholder-like fields that lacked grounded industrial guidance, and InternVL3-2B proved too slow/risky (context failures at lower context and unsafe-sounding wording even when it completed).
- Qwen2.5-Omni-3B ran cleanly, but the author suggests its biggest value may be in future audio/video-extended branches rather than as the immediate best baseline for this structured edge agent task.
Continue reading this article on the original site.
Read original →Related Articles

Black Hat USA
AI Business
Why Your Agents Are Silently Burning Tokens (And How to Stop Them)
Dev.to
We Gave AI a Topic and It Wrote a Full Blog Post. Here's What Actually Happened.
Dev.to

Everyone says AI needs more GPUs. I profiled one and it was sitting idle most of the time, just waiting on data. how much of the "GPU shortage" is actually wasted GPUs?
Reddit r/artificial
Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task
Dev.to