Running Qwen3.5-27B locally as the primary model in OpenCode

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

Key Points

  • A tester ran the Qwen3.5-27B LLM locally as the primary model for an agentic coding assistant (OpenCode/Codex-style workflow) to evaluate real coding and tool-calling performance.
  • On an NVIDIA RTX 4090 (24GB) using llama.cpp with a 4-bit quantized, 64K context setup, they reported about ~2,400 tok/s prefill and ~40 tok/s generation while using OpenCode over Tailscale from a MacBook.
  • The model performed surprisingly well for agentic tasks such as writing multiple Python scripts, making edits, debugging, testing, and executing code with correct tool calling.
  • Performance improved further when adding agent skills and using Context7 as an MCP server to pull up-to-date documentation, but it was not ideal for “vibe coding” with loose prompts.
  • The author emphasizes that achieving good agent behavior requires careful decisions around quantization, model/chat templates for tool calling, context size, and KV cache settings, and they published a step-by-step blog with practical gotchas.
Running Qwen3.5-27B locally as the primary model in OpenCode

This weekend I wanted to test how well a local LLM can work as the primary model for an agentic coding assistant like OpenCode or OpenAI Codex. I picked Qwen3.5-27B, a hybrid architecture model that has been getting a lot of attention lately for its performance relative to its size, set it up locally and ran it with OpenCode to see how far it could go.

I set it up on my NVIDIA RTX4090 (24GB) workstation running the model via llama.cpp and using it with OpenCode running on my macbook (connection via Tailscale).

Setup:

  • RTX 4090 workstation running llama.cpp
  • OpenCode on my MacBook
  • 4-bit quantized model, 64K context size, ~22GB VRAM usage
  • ~2,400 tok/s prefill, ~40 tok/s generation

Based on my testing:

  • It works surprisingly well and makes correct tool calling for tasks like writing multiple Python scripts, making edits, debugging, testing and executing code.
  • The performance improved noticeably when I used it with agent skills and added Context7 as an MCP server to fetch up-to-date documentation.
  • That said, this is definitely not the best setup for vibe coding with crude prompts and loose context. There, GPT-5.4 and Opus/Sonnet are naturally way ahead.
  • However, if you are willing to plan properly and provide the right context, it performs well.
  • It is much easier to set it up with OpenCode than Codex.

I would say setting up the whole workflow was a great learning experience in itself. It is one thing to use a local model as a chat assistant and another to use it with an agentic coding assistant, especially getting tool calling with correct agentic behavior working. You have to make a lot of decisions: the right quantization that fits well on your machine, best model in the size category, correct chat template for tool calling, best context size and KV cache settings.

I also wrote a detailed blog covering the full setup, step by step, along with all the gotchas and practical tips I learned.

Happy to answer any questions about the setup.

Blogpost: https://aayushgarg.dev/posts/2026-03-29-local-llm-opencode/

submitted by /u/garg-aayush
[link] [comments]