Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.

Reddit r/LocalLLaMA / 5/3/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author tested newly dropped Qwen 3.6 against Gemma 4 using local vLLM FP8 runs on 27B/31B vision models, finding that real-world behavior can contradict benchmark rankings.
In “overthinking” scenarios, Qwen 3.6 shows improved thinking token usage on simple prompts, but still collapses into long reasoning loops on obscure cases and may fail to produce final answers, while Gemma 4 stays more concise.
For instruction-sensitive tasks like bounding boxes and polygon/segmentation-style outputs, Gemma 4 follows coordinate/format requirements (e.g., normalized 0–1 JSON) more reliably than Qwen, which often outputs unscaled 0–1000 coordinates in messy formats.
The tests suggest training-data cultural/regional bias: Gemma 4 handles European/Western visual knowledge better, while Qwen 3.6 performs relatively better for Asian-context content.
The overall takeaway is that “benchmaxing” (benchmark-focused optimization) seems insufficient to predict practical performance, so stress-testing on messy, unoptimized, real tasks is important.

Qwen 3.6 wins the benchmarks, but Gemma 4 wins reality. 7 things I learned testing 27B/31B Vision models locally (vLLM / FP8) side by side. Benchmaxing seems real.

Hey guys,

A couple of weeks ago, I asked this sub for the hardest Vision use cases you were dealing with to test the newly dropped Qwen 3.6 against Gemma 4. I finally finished running the gauntlet side-by-side locally on vLLM (FP8 quants) using my custom GUI.

If you look at the Benchmarks then Qwen should win but from testing it seems really opposite. Looks like Benchmaxing. I attached comparison of scores below

Since official benchmarks are pretty much gamed at this point, I threw real-world, unoptimized junk at them: weird memes, complex GeoGuessr spots, ugly handwritten notes, shopping lists, bounding box requests, and dynamic gym videos.

Here are the 5 biggest behavioral differences and quirks I found:

- Did Qwen 3.6 fix the "Overthinking" token burn?
Yes and no. In Qwen 3.5, the model would burn 10k tokens overthinking simple tasks. In 3.6, the thinking preservation is noticeably better on simple prompts—it stops earlier. However, if you give it an obscure GeoGuessr location or a rare meme, it still panics, goes into a massive reasoning loop, burns 8,000+ tokens, and sometimes fails to output a final answer. Gemma 4 remains vastly more concise (often using just 1,500 tokens for the same task).

- Bounding Boxes & Scaling: Qwen still fights instructions
If you want to extract coordinates for bounding boxes or polygon segmentation masks, Gemma 4 is much better at following formatting instructions. Which make sense as I didn't find any information about this capability on Qwen. Visual models are usually trained on a 0–1000 coordinate grid. When I prompted them to output normalized coordinates (0 to 1), Gemma calculated the scaling perfectly in its thinking phase and output clean JSON. Qwen completely ignored the scaling instruction and output raw 0-1000 coordinates in a weird format most of times.

- The Cultural Divide (Memes & GeoGuessr)
There is a regional bias in their training data.

Gemma 4 easily won European/Western tasks (recognizing obscure European monuments as example).
Qwen 3.6 seem to perform better on Asian context. It accurately identified the Chinese "white people food" meme and correctly guessed an obscure Malaysia/Indonesia border town in GeoGuessr—even without thinking mode enabled.

- Qwen 3.6 is a upgrade for Video tracking
I fed both models a video of me doing deadlifts (pre-processed to 2 FPS to avoid vLLM rejection). Qwen 3.6 was incredible here. With the thinking budget tuned, it correctly identified the exercise, counted the exact number of reps (Gemma missed one), and most accurately estimated the total weight on the bar by judging plate thickness.

- AI Video Detection is still a coin toss
I tested them on videos generated by LTX 2.3. Both models successfully caught blatant physics errors (like balls changing color or smoke without a source). But on more subtle AI videos, they were completely inconsistent. Running the exact same prompt twice would yield "Real" one time and "AI generated" the next. Neither is reliable for deepfake detection yet.

- Don't trust Inference Engines default visual token budget for Gemma
If you're running Gemma and it's failing at fine visual details (like small OCR text or complex graphs), check your max_soft_tokens. Inference engines like vLLM, Llama Cpp often default this to a shockingly low number, like 280. A lot of people think the model is just performing poorly, but it's actually just heavily compressing the image input. If you crank this value up (e.g., to over 1120), the accuracy instantly spikes. The best part? In my testing, maxing out this visual token budget added almost zero noticeable latency. Don't cheap out on your visual tokens!

- Video Pipeline Friction: Gemma eats raw video, Qwen demands 2 FPS
If you are building an automated pipeline, be aware of this input quirk: Gemma 4's encoder is incredibly forgiving and will accept pretty much any video format or framerate you throw directly at it. Qwen 3.6, on the other hand, is extremely strict. You must pre-process your video down to 2 FPS before passing it to vLLM, otherwise it will just throw errors or fail to process.

Resources:
If you want to see the actual latency differences, how I tuned the visual token budgets, and the live inference side-by-side, I put together a repo with uv sync etc here: https://github.com/lukaLLM/Gemma4_vs_Qwen3.5_3.6_Vision_Setup_Dockers There is also video of tests if needed.

Let me know also how you use it so far.

https://preview.redd.it/420ns466vqyg1.png?width=1024&format=png&auto=webp&s=7aad733c5a3002c628e1cb9fe470f64032bee0b6

submitted by /u/FantasticNature7590
[link] [comments]