Local Claude Code with Qwen3.5 27B

Reddit r/LocalLLaMA / 4/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article documents a setup for running Claude Code fully offline using a locally hosted Qwen3.5 27B model served via llama.cpp on Arch Linux (Strix Halo hardware).
It provides environment-variable configurations to disable telemetry, auto-updaters, and nonessential traffic, while noting that using `claude/settings.json` is more stable than relying on shell exports.
The author shares a concrete llama.cpp launch command, including context size (ctx-size 65536), GPU layer offload (n-gpu-layers 999), ROCm tuning (ROCBLAS_USE_HIPBLASLT=1), and inference sampling parameters.
A 7-run results table compares performance and output quality across task types (simple file operations, git clone + code reading, and planning), showing generally correct/excellent results with high token generation speed.
The workflow is organized into iterative “sessions” to improve Claude Code and llama.cpp parameters over time, aiming for reliable tool chaining and coding-task performance.

after long research, finding best alternative for
Using a local LLM in OpenCode with llama.cpp
to use totally local environment for coding tasks
I found this article How to connect Claude Code CLI to a local llama.cpp server
how to disable telemetry and make claude code totally offline.

model used - Qwen3.5 27B
Quant used - unsloth/UD-Q4_K_XL
inference engine - llama.cpp
Operating Systems - Arch Linux
Hardware - Strix Halo

I have separated my setups into sessions to run iterative cycle how I managed to improve CC (claude code) and llama.cpp model parameters.

First Session

as guide stated, I used option 1 to disable telemetry

~/.bashrc config;

export ANTHROPIC_BASE_URL="http://127.0.0.1:8001" export ANTHROPIC_API_KEY="not-set" export ANTHROPIC_AUTH_TOKEN="not-set" export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 export CLAUDE_CODE_ENABLE_TELEMETRY=0 export DISABLE_AUTOUPDATER=1 export DISABLE_TELEMETRY=1 export CLAUDE_CODE_DISABLE_1M_CONTEXT=1 export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096 export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768

Spoiler: better to use claude/settings.json it is more stable and controllable.

and in ~/.claude.json

"hasCompletedOnboarding": true

llama.cpp config:

ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-Q4_K_M.gguf \ --alias "qwen3.5-27b" \ --port 8001 --ctx-size 65536 --n-gpu-layers 999 \ --flash-attn on --jinja --threads 8 \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \ --cache-type-k q8_0 --cache-type-v q8_0

I am using Strix Halo so I need to setup ROCBLAS_USE_HIPBLASLT=1
research your concrete hardware to specialize llama.cpp setup
everything else might be same.

Results for 7 Runs:

Run	Task Type	Duration	Gen Speed	Peak Context	Quality	Key Finding
1	File ops (ls, cat)	1m44s	9.71 t/s	23K	Correct	Baseline: fast at low context
2	Git clone + code read	2m31s	9.56 t/s	32.5K	Excellent	Tool chaining works well
3	7-day plan + guide	4m57s	8.37 t/s	37.9K	Excellent	Long-form generation quality
4	Skills assessment	4m36s	8.46 t/s	40K	Very good	Web search broken (needs Anthropic)
5	Write Python script	10m25s	7.54 t/s	60.4K	Good (7/10)
6	Code review + fix	9m29s	7.42 t/s	65,535 CRASH	Very good (8.5/10)	Context wall hit, no auto-compact
7	/compact command	~10m	~8.07 t/s	66,680 (failed)	N/A	Output token limit too low for compaction

Lessons

Generation speed degrades ~24% across context range: 9.71 t/s (23K) down to 7.42 t/s (65K)
Claude Code System prompt = 22,870 tokens (35% of 65K budget)
Auto-compaction was completely broken: Claude Code assumed 200K context, so 95% threshold = 190K. 65K limit was hit at 33% of what Claude Code thought was the window.
/compact needs output headroom: At 4096 max output, the compaction summary can't fit. Needs 16K+.
Web search is dead without Anthropic (Run 4): Solution is SearXNG via MCP or if someone has better solution, please suggest.
LCP prefix caching works great: sim_best = 0.980 means the system prompt is cached across turns
Code quality is solid but instructions need precision: I plan to add second reviewer agent to suggest fixes.

VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB (CC is super heavy)

Second Session

claude/settings.json config:

{ "env": { "ANTHROPIC_BASE_URL": "http://127.0.0.1:8001", "ANTHROPIC_MODEL": "qwen3.5-27b", "ANTHROPIC_SMALL_FAST_MODEL": "qwen3.5-27b", "ANTHROPIC_API_KEY": "sk-no-key-required", "ANTHROPIC_AUTH_TOKEN": "", "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1", "DISABLE_COST_WARNINGS": "1", "CLAUDE_CODE_ATTRIBUTION_HEADER": "0", "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1", "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32768", "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "65536", "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "90", "DISABLE_PROMPT_CACHING": "1", "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1", "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1", "MAX_THINKING_TOKENS": "0", "CLAUDE_CODE_DISABLE_FAST_MODE": "1", "DISABLE_INTERLEAVED_THINKING": "1", "CLAUDE_CODE_MAX_RETRIES": "3", "CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1", "DISABLE_TELEMETRY": "1", "CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY": "1", "ENABLE_TOOL_SEARCH": "auto", "DISABLE_AUTOUPDATER": "1", "DISABLE_ERROR_REPORTING": "1", "DISABLE_FEEDBACK_COMMAND": "1" } }

llama.cpp run:

ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \ --alias "qwen3.5-27b" \ --port 8001 \ --ctx-size 65536 \ --n-gpu-layers 999 \ --flash-attn on \ --jinja \ --threads 8 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --cache-type-k q8_0 \ --cache-type-v q8_0

claude --model qwen3.5-27b --verbose

VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB
nothing changed.

all the errors from first session were fixed )

Third Session (Vision)

To turn on vision for qwen, you are required to use mmproj, which was included with gguf.

setup:

ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \ --alias "qwen3.5-27b" \ --port 8001 \ --ctx-size 65536 \ --n-gpu-layers 999 \ --flash-attn on \ --jinja \ --threads 8 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --mmproj models/Qwen3.5-27B-GGUF/mmproj-F32.gguf

and its only added 1-2 ram usage.

tested with 8 Images and quality of vision was WOW to me.
if you look at Artificial Analysis Vision Benchmark, qwen is on [Claude 4.6 Opus](Claude 4.6 Opus) level which makes it superior for vision tasks.

My tests showed that it can really good understand context of image and handwritten diagrams.

Verdict

system prompt is too big and takes too much time to load. but this is only first time, then caching makes everything for you.
CC is worth using with local models and local models nowadays are good for coding tasks. and I found it most "offline" coding agent CLI compared to [Opencode](Opencode), why I should use less "performant" alternative, when I can use SOTA )

Future Experiments:
- I want to use bigger [Mixture of Experts](Mixture of Experts) model from [Qwen3.5](Qwen3.5) Family, but will it give me better 2x performance for 2x size?
- want to try CC with [Zed](Zed) editor, and check how offline zed will behave with local CC.
- How long compaction will hold agents reasoning and how quality gonna degrade, with codex or CC I had 10M context chats with decent quality compared to size.

submitted by /u/FeiX7
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/5DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

Dev.to

How I Found $1,240/Month in Wasted LLM API Costs (And Built a Tool to Find Yours)

Dev.to

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

Dev.to

Local Claude Code with Qwen3.5 27B

Key Points

First Session

Second Session

Third Session (Vision)

Verdict

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

I Built a Voice AI with Sub-500ms Latency. Here's the Echo Cancellation Problem Nobody Talks About

How I Found $1,240/Month in Wasted LLM API Costs (And Built a Tool to Find Yours)

LLM Semantic Caching: The 95% Hit Rate Myth (and What Production Data Actually Shows)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer