Qwen 3.6 for Claude Code in 1L

Reddit r/LocalLLaMA / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A Reddit user reports running Qwen 3.6 (35B, Q4KM quantized) locally in “Claude Code” (cc/Claude Code) on a constrained setup (RTX 2000 Ada 16GB VRAM), sharing practical performance observations.
They mention hardware thermal mitigation (printing a fan hanger) because the system “gets hot” during use.
The user credits a specific change/PR to llama.cpp (linked in the post) as necessary to make the integration work well with Claude Code.
They highlight that caching prompt prefixes (enabled by a change) significantly improves throughput and overall responsiveness for these “newfangled tools.”
Reported figures include ~24 t/s generation and very high prompt-side throughput (about 400 t/s), emphasizing that the workflow can be made efficient even on local hardware.

https://preview.redd.it/a96i13zyemvg1.png?width=374&format=png&auto=webp&s=d1850127462849eab4ff37a3e10159d092bcc994

I use a p3 tiny gen 2 with an rtx 2000 ada (16gb vram). It gets hot, so I modeled and printed a fan hanger to keep it cool. It's dumb, but it feels like Claude Code, just unlimited.
I did have to use the change in this PR to make llamacpp work well with cc though: https://github.com/ggml-org/llama.cpp/pull/21793/
Qwen 3.6 35b a3b q4km unsloth, 400 t/s prompt, 24 t/s generation. With the change to let prompt prefixes cache, I'm amazed at what these newfangled tools can generate. Have a great day folks, I just wanted to share my experience with someone <3

submitted by /u/brickinthefloor
[link] [comments]