Running Qwen3.5-27B locally as the primary model in OpenCode

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A tester ran the Qwen3.5-27B LLM locally as the primary model for an agentic coding assistant (OpenCode/Codex-style workflow) to evaluate real coding and tool-calling performance.
On an NVIDIA RTX 4090 (24GB) using llama.cpp with a 4-bit quantized, 64K context setup, they reported about ~2,400 tok/s prefill and ~40 tok/s generation while using OpenCode over Tailscale from a MacBook.
The model performed surprisingly well for agentic tasks such as writing multiple Python scripts, making edits, debugging, testing, and executing code with correct tool calling.
Performance improved further when adding agent skills and using Context7 as an MCP server to pull up-to-date documentation, but it was not ideal for “vibe coding” with loose prompts.
The author emphasizes that achieving good agent behavior requires careful decisions around quantization, model/chat templates for tool calling, context size, and KV cache settings, and they published a step-by-step blog with practical gotchas.

Running Qwen3.5-27B locally as the primary model in OpenCode

This weekend I wanted to test how well a local LLM can work as the primary model for an agentic coding assistant like OpenCode or OpenAI Codex. I picked Qwen3.5-27B, a hybrid architecture model that has been getting a lot of attention lately for its performance relative to its size, set it up locally and ran it with OpenCode to see how far it could go.

I set it up on my NVIDIA RTX4090 (24GB) workstation running the model via llama.cpp and using it with OpenCode running on my macbook (connection via Tailscale).

Setup:

RTX 4090 workstation running llama.cpp
OpenCode on my MacBook
4-bit quantized model, 64K context size, ~22GB VRAM usage
~2,400 tok/s prefill, ~40 tok/s generation

Based on my testing:

It works surprisingly well and makes correct tool calling for tasks like writing multiple Python scripts, making edits, debugging, testing and executing code.
The performance improved noticeably when I used it with agent skills and added Context7 as an MCP server to fetch up-to-date documentation.
That said, this is definitely not the best setup for vibe coding with crude prompts and loose context. There, GPT-5.4 and Opus/Sonnet are naturally way ahead.
However, if you are willing to plan properly and provide the right context, it performs well.
It is much easier to set it up with OpenCode than Codex.

I would say setting up the whole workflow was a great learning experience in itself. It is one thing to use a local model as a chat assistant and another to use it with an agentic coding assistant, especially getting tool calling with correct agentic behavior working. You have to make a lot of decisions: the right quantization that fits well on your machine, best model in the size category, correct chat template for tool calling, best context size and KV cache settings.

I also wrote a detailed blog covering the full setup, step by step, along with all the gotchas and practical tips I learned.

Happy to answer any questions about the setup.

Blogpost: https://aayushgarg.dev/posts/2026-03-29-local-llm-opencode/

submitted by /u/garg-aayush
[link] [comments]