| Hi guys I have running experiments on Qwen 3.5 Vision hard for a few weeks on vLLM + llama.cpp in Docker. A few things I find out. 1. Long-video OOM is almost always these three vLLM flags `--max-model-len`, `--max-num-batched-tokens`, `--max-num-seqs A 1h45m video can hit 18k+ visual tokens and blow past the 16k default before inference even starts. Chunk at the application level (≤300s segments), free the KV cache between chunks, then you can do a second-pass summary to run it even on low local resources, 2. Segment overlap matter Naive chunking splits events at boundaries. Even 2 seconds of overlap recovers meaningful context — 10s is better if your context budget allows it. 3. Preprocessing is the most underrated lever 1 FPS + 360px height cut a 1m40s video from \~7s to \~3.5s inference with acceptable accuracy. Do it yourself rather than leaving it to vLLM it takes longer as probably full size video got feeded into engine — preprocessing time is a bigger fraction of total latency than most people assume. For images: 256px was the sweet spot (128px and the model couldn't recognize cats). 4. Stable image vs. nightly `vllm/vllm-openai:latest` had lower latency than the nightly build in my runs, despite nightly being recommended for Blackwell. Test both on your hardware before assuming newer = faster. 5. Structured outputs — wire in instructor 4B will produce malformed JSON even with explicit prompt instructions. Use instructor + Pydantic schema with automatic retry if you're piping chunk results to downstream code. 6. Concurrency speedup is real 2 parallel requests → \~24% faster. 10 concurrent sequences → \~70–78% throughput improvement depending on attention backend. I put things I used for test in repo if anybody is interested. It has Docker Compose configs for 0.8B / 4B / 27B-FP8 etc. benchmark results, and a Gradio app to test preprocessing and chunking parameters without writing any code. Just `uv sync` and run: github.com/lukaLLM/Qwen_3_5_Vision_Setup_Dockers Curious if anyone has found other ways to squeeze more juice out of it or any interesting vision tasks you guys have been running? [link] [comments] |
Qwen 3.5 Vision on vLLM + llama.cpp — 6 things I find out after few weeks testing (preprocessing speedups, concurrency).
Reddit r/LocalLLaMA / 4/2/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The author reports that Qwen 3.5 Vision long-video runs are often OOM due to vLLM settings like `--max-model-len`, `--max-num-batched-tokens`, and `--max-num-seqs`, and recommends chunking videos into ≤300s segments while freeing the KV cache between chunks plus optional second-pass summarization.
- They emphasize that chunk segment overlap materially improves results, with even ~2 seconds of overlap recovering important context and ~10 seconds performing better when the context budget allows.
- The biggest practical speed lever is preprocessing: reducing frame rate (e.g., ~1 FPS) and image resolution (e.g., ~360px height; images around 256px) can cut inference time substantially compared with letting the engine ingest full-size video.
- Model/container choice matters for latency: the author found `vllm-openai:latest` sometimes outperformed nightly builds on their hardware, so they advise testing rather than assuming newer always runs faster.
- For downstream integration, the author recommends structured output validation by combining instructor + Pydantic schemas with automatic retry, since even the 4B model may output malformed JSON.
- They find concurrency improves throughput in practice (e.g., 2 parallel requests yields ~24% faster performance and 10 concurrent sequences increases throughput by ~70–78%), and they provide a Docker-based repo and Gradio app to benchmark preprocessing/chunking parameters.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Z.ai Launches GLM-5V-Turbo: A Native Multimodal Vision Coding Model Optimized for OpenClaw and High-Capacity Agentic Engineering Workflows Everywhere
MarkTechPost

How I Started Using AI Agents for End-to-End Testing (Autonoma AI)
Dev.to

How We Built an AI Coach That Understands PTSD — And Why It Matters
Dev.to