I Ran Five Small Multimodal Models on a Jetson. The Fastest One Was Not the Best Baseline.

Dev.to / 6/18/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author built WearEdge Pro, a wearable industrial edge AI runtime that outputs structured “action cards” (not chat) with audit trails, mode boundaries, and human confirmation for workflows like maintenance and safety.
  • They benchmarked five compact multimodal models on a Jetson using the same image/text prompts and a fixed “gateway” budget (560 image tokens, plus an extra 1024-token pass for Qwen2.5-VL to improve grounding).
  • Gemma 4 E2B produced the strongest overall baseline behavior and was the “best product baseline,” while Qwen2.5-VL-3B was the best challenger with particularly strong changeover OCR and useful IQC defect scoring.
  • SmolVLM2-2.2B was the fastest but often returned overly generic or placeholder-like fields that lacked grounded industrial guidance, and InternVL3-2B proved too slow/risky (context failures at lower context and unsafe-sounding wording even when it completed).
  • Qwen2.5-Omni-3B ran cleanly, but the author suggests its biggest value may be in future audio/video-extended branches rather than as the immediate best baseline for this structured edge agent task.

Continue reading this article on the original site.

Read original →