| Hey guys, A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants) using my custom GUI. If you look at the Benchmarks then Qwen should win but from testing it seems really opposite. Looks like Benchmaxing. I attached comparison of scores below Since official benchmarks are pretty much gamed at this point, I threw real-world, unoptimized junk at them: weird memes, complex GeoGuessr spots, ugly handwritten notes, shopping lists, bounding box requests, and dynamic gym videos. Here are the 5 biggest behavioral differences and quirks I found: - Did Qwen 3.6 fix the "Overthinking" token burn? - Bounding Boxes & Scaling: Qwen still fights instructions - The Cultural Divide (Memes & GeoGuessr)
- Qwen 3.6 is a upgrade for Video tracking - AI Video Detection is still a coin toss - Don't trust Inference Engines default visual token budget for Gemma - Video Pipeline Friction: Gemma eats raw video, Qwen demands 2 FPS Resources: Let me know also how you use it so far. [link] [comments] |
Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.
Reddit r/LocalLLaMA / 5/3/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author tested newly dropped Qwen 3.6 against Gemma 4 using local vLLM FP8 runs on 27B/31B vision models, finding that real-world behavior can contradict benchmark rankings.
- In “overthinking” scenarios, Qwen 3.6 shows improved thinking token usage on simple prompts, but still collapses into long reasoning loops on obscure cases and may fail to produce final answers, while Gemma 4 stays more concise.
- For instruction-sensitive tasks like bounding boxes and polygon/segmentation-style outputs, Gemma 4 follows coordinate/format requirements (e.g., normalized 0–1 JSON) more reliably than Qwen, which often outputs unscaled 0–1000 coordinates in messy formats.
- The tests suggest training-data cultural/regional bias: Gemma 4 handles European/Western visual knowledge better, while Qwen 3.6 performs relatively better for Asian-context content.
- The overall takeaway is that “benchmaxing” (benchmark-focused optimization) seems insufficient to predict practical performance, so stress-testing on messy, unoptimized, real tasks is important.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Panduan Lengkap TestSprite MCP Server: Dari Instalasi hingga Pengujian Pertama
Dev.to

Accelerating CNN inference on FPGAs: A Survey
Dev.to

Pudgy Penguins'de AI Tabanlı Tokenomik Etkileşimlerin DeFi Uygulamaları
Dev.to