| Hey everyone, I did a small personal benchmark on using local models to detect UI icons from application screenshots. English is not my first language, so sorry for any grammar mistakes! I just wanted to share what I found in case it helps someone doing similar stuff. Models includes(none quantization):
Approach:I feed the app screenshot into the LLM and ask it to recognize the UI icons and return the bbox_2d coordinates. After it gives me the coordinates, I use supervision to draw red bounding boxes on the image. Finally, I just check the results manually by eye. For the setup, I used the newest vLLM v0.19.1 doing offline inference. I set the starting temperature to 0 because I want the most confident output. If the model returns 0 icons, I gradually increase the temperature: 0 -> 0.3 -> 0.6 -> 0.9. Overall Results:Overall, the Dense model is much better than the MoE model for this task. My ranking: Qwen3.5 > Qwen3.6 ≈ Gemma4 Some specific findings:
Here are the detail vllm parameters: Has anyone else tried UI element detection with local models recently? Curious if you guys have any tricks for getting better bounding boxes. [link] [comments] |
UI Icon Detection with Qwen3.5, Qwen3.6 and Gemma4
Reddit r/LocalLLaMA / 4/19/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The post describes a small benchmark for detecting UI icons (from application screenshots) using local multimodal LLMs that output bbox_2d coordinates.
- The pipeline feeds screenshots to the LLM, uses supervision to draw red bounding boxes, and then relies on manual visual checking of results.
- Using non-quantized local models with vLLM 0.19.1 in offline inference, the author finds that dense models outperform MoE models for this specific icon-detection task.
- The reported ranking is Qwen3.5 > Qwen3.6 ≈ Gemma4, with notable failures such as Gemma4 failing to detect any icons in a Cursor IDE screenshot and Qwen3.6 producing incorrect “giant icon” detections on Photoshop.
- The author also experiments with generation temperature (starting at 0 and increasing up to 0.9 if no icons are returned) to encourage more confident outputs and recovery when detections fail.




