I Ran Five Small Multimodal Models on a Jetson. The Fastest One Was Not the Best Baseline.

Dev.to / 6/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author built WearEdge Pro, a wearable industrial edge AI runtime that outputs structured “action cards” (not chat) with audit trails, mode boundaries, and human confirmation for workflows like maintenance and safety.
They benchmarked five compact multimodal models on a Jetson using the same image/text prompts and a fixed “gateway” budget (560 image tokens, plus an extra 1024-token pass for Qwen2.5-VL to improve grounding).
Gemma 4 E2B produced the strongest overall baseline behavior and was the “best product baseline,” while Qwen2.5-VL-3B was the best challenger with particularly strong changeover OCR and useful IQC defect scoring.
SmolVLM2-2.2B was the fastest but often returned overly generic or placeholder-like fields that lacked grounded industrial guidance, and InternVL3-2B proved too slow/risky (context failures at lower context and unsafe-sounding wording even when it completed).
Qwen2.5-Omni-3B ran cleanly, but the author suggests its biggest value may be in future audio/video-extended branches rather than as the immediate best baseline for this structured edge agent task.

Continue reading this article on the original site.

AI Business

Dev.to

Dev.to

Reddit r/artificial

Dev.to