Qwen 3.5 Vision on vLLM + llama.cpp — 6 things I find out after few weeks testing (preprocessing speedups, concurrency).

Reddit r/LocalLLaMA / 4/2/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author reports that Qwen 3.5 Vision long-video runs are often OOM due to vLLM settings like `--max-model-len`, `--max-num-batched-tokens`, and `--max-num-seqs`, and recommends chunking videos into ≤300s segments while freeing the KV cache between chunks plus optional second-pass summarization.
They emphasize that chunk segment overlap materially improves results, with even ~2 seconds of overlap recovering important context and ~10 seconds performing better when the context budget allows.
The biggest practical speed lever is preprocessing: reducing frame rate (e.g., ~1 FPS) and image resolution (e.g., ~360px height; images around 256px) can cut inference time substantially compared with letting the engine ingest full-size video.
Model/container choice matters for latency: the author found `vllm-openai:latest` sometimes outperformed nightly builds on their hardware, so they advise testing rather than assuming newer always runs faster.
For downstream integration, the author recommends structured output validation by combining instructor + Pydantic schemas with automatic retry, since even the 4B model may output malformed JSON.
They find concurrency improves throughput in practice (e.g., 2 parallel requests yields ~24% faster performance and 10 concurrent sequences increases throughput by ~70–78%), and they provide a Docker-based repo and Gradio app to benchmark preprocessing/chunking parameters.

Qwen 3.5 Vision on vLLM + llama.cpp — 6 things I find out after few weeks testing (preprocessing speedups, concurrency).

Hi guys

I have running experiments on Qwen 3.5 Vision hard for a few weeks on vLLM + llama.cpp in Docker. A few things I find out.

1. Long-video OOM is almost always these three vLLM flags

`--max-model-len`, `--max-num-batched-tokens`, `--max-num-seqs

A 1h45m video can hit 18k+ visual tokens and blow past the 16k default before inference even starts. Chunk at the application level (≤300s segments), free the KV cache between chunks, then you can do a second-pass summary to run it even on low local resources,

2. Segment overlap matter

Naive chunking splits events at boundaries. Even 2 seconds of overlap recovers meaningful context — 10s is better if your context budget allows it.

3. Preprocessing is the most underrated lever

1 FPS + 360px height cut a 1m40s video from \~7s to \~3.5s inference with acceptable accuracy. Do it yourself rather than leaving it to vLLM it takes longer as probably full size video got feeded into engine — preprocessing time is a bigger fraction of total latency than most people assume.

For images: 256px was the sweet spot (128px and the model couldn't recognize cats).

4. Stable image vs. nightly

`vllm/vllm-openai:latest` had lower latency than the nightly build in my runs, despite nightly being recommended for Blackwell. Test both on your hardware before assuming newer = faster.

5. Structured outputs — wire in instructor

4B will produce malformed JSON even with explicit prompt instructions. Use instructor + Pydantic schema with automatic retry if you're piping chunk results to downstream code.

6. Concurrency speedup is real

2 parallel requests → \~24% faster. 10 concurrent sequences → \~70–78% throughput improvement depending on attention backend.

I put things I used for test in repo if anybody is interested. It has Docker Compose configs for 0.8B / 4B / 27B-FP8 etc. benchmark results, and a Gradio app to test preprocessing and chunking parameters without writing any code. Just `uv sync` and run:

github.com/lukaLLM/Qwen_3_5_Vision_Setup_Dockers

Curious if anyone has found other ways to squeeze more juice out of it or any interesting vision tasks you guys have been running?

https://preview.redd.it/5pdesy8ylmsg1.png?width=1601&format=png&auto=webp&s=bff29d8d945dc2c801b3c6acbbef6d9e187663b9

submitted by /u/FantasticNature7590
[link] [comments]