Running Qwen3.6-35B-A3B Locally for Coding Agent: My Setup & Working Config

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article details a local setup to run the Qwen3.6-35B-A3B model on a MacBook Pro (Apple M2 Max) using llama.cpp as the backend.
It describes how the pi coding agent connects to a local llama-server via an OpenAI-compatible API, including the specific configuration in ~/.pi/agent/models.json.
The author provides the exact llama-server startup command and explains key parameters for context length (128K), output length (32K), and sampling controls (temperature, top-p, top-k, and repeat/presence penalties).
The setup emphasizes a tradeoff using a Hugging Face GGUF quantization (UD-Q5_K_XL) to balance quality and disk/memory footprint (about 19 GB) for workable local coding-agent use.
Overall, it serves as a practical “working config” reference for developers who want to use an LLM-driven coding agent with locally hosted Qwen models.

Hardware

Component	Details
Machine	MacBook Pro (Mac14,6)
Chip	Apple M2 Max — 12-core CPU (8P + 4E)
Memory	64 GB unified memory
Storage	512 GB SSD
OS	macOS 15.7 (Sequoia)

AI Agent Setup

I'm using the pi coding agent as my primary development assistant. It's a local-first AI coding agent that connects to local models via llama.cpp.

Model: Qwen3.6-35B-A3B (running via llama.cpp)

How pi Connects to llama-server

The pi agent communicates with llama-server via the OpenAI-compatible API. Configuration lives in ~/.pi/agent/models.json:

{ "providers": { "llama-cpp": { "baseUrl": "http://127.0.0.1:8080/v1", "api": "openai-completions", "apiKey": "ignored", "models": [{ "id": "Qwen3.6-35B-A3B", "contextWindow": 131072, "maxTokens": 32768 }] } } }

The Command

llama-server \ -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q5_K_XL \ -c 131072 \ -n 32768 \ --no-context-shift \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --repeat-penalty 1.00 \ --presence-penalty 0.00 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --batch-size 4096 \ --ubatch-size 4096

Parameter Breakdown

Flag	Value	Why
`-hf`	`unsloth/...:UD-Q5_K_XL`	HuggingFace model repo with unsloth's custom UD quantization — good quality/size tradeoff (~19 GB)
`-c 131072`	128K context	This model supports a massive context window — set it high for long documents or extended conversations
`-n 32768`	32K output tokens	Allows long single-turn generations without hitting the generation limit
`--no-context-shift`	Off	Prevents context shifting during generation — keeps long responses coherent
`--chat-template-kwargs`	`preserve_thinking: true`	Keeps the model's reasoning/thinking blocks intact in the output
`--batch-size 4096`	4096	Logical batch size — higher = faster prompt processing, needs more memory
`--ubatch-size 4096`	4096	Physical batch size — kept equal to logical batch for consistency

Sampling Parameters

The sampling parameters (--temp, --top-p, --top-k, --repeat-penalty, --presence-penalty) are taken directly from unsloth's recommended config for Qwen3.6. I use these as-is since they're the official recommendations from the model's creators and produce good results out of the box.

submitted by /u/NoConcert8847
[link] [comments]

Black Hat USA

AI Business

I’m working on an AGI and human council system that could make the world better and keep checks and balances in place to prevent catastrophes. It could change the world. Really. Im trying to get ahead of the game before an AGI is developed by someone who only has their best interest in mind.

Reddit r/artificial

Deepseek V4 Flash and Non-Flash Out on HuggingFace

Reddit r/LocalLLaMA

DeepSeek V4 Flash & Pro Now out on API

Reddit r/LocalLLaMA

I’m building a post-SaaS app catalog on Base, and here’s what that actually means

Dev.to