after long research, finding best alternative for
Using a local LLM in OpenCode with llama.cpp
to use totally local environment for coding tasks
I found this article How to connect Claude Code CLI to a local llama.cpp server
how to disable telemetry and make claude code totally offline.
model used - Qwen3.5 27B
Quant used - unsloth/UD-Q4_K_XL
inference engine - llama.cpp
Operating Systems - Arch Linux
Hardware - Strix Halo
I have separated my setups into sessions to run iterative cycle how I managed to improve CC (claude code) and llama.cpp model parameters.
First Session
as guide stated, I used option 1 to disable telemetry
~/.bashrc config;
export ANTHROPIC_BASE_URL="http://127.0.0.1:8001" export ANTHROPIC_API_KEY="not-set" export ANTHROPIC_AUTH_TOKEN="not-set" export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 export CLAUDE_CODE_ENABLE_TELEMETRY=0 export DISABLE_AUTOUPDATER=1 export DISABLE_TELEMETRY=1 export CLAUDE_CODE_DISABLE_1M_CONTEXT=1 export CLAUDE_CODE_MAX_OUTPUT_TOKENS=4096 export CLAUDE_CODE_AUTO_COMPACT_WINDOW=32768 Spoiler: better to use claude/settings.json it is more stable and controllable.
and in ~/.claude.json
"hasCompletedOnboarding": true llama.cpp config:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-Q4_K_M.gguf \ --alias "qwen3.5-27b" \ --port 8001 --ctx-size 65536 --n-gpu-layers 999 \ --flash-attn on --jinja --threads 8 \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 \ --cache-type-k q8_0 --cache-type-v q8_0 I am using Strix Halo so I need to setup ROCBLAS_USE_HIPBLASLT=1
research your concrete hardware to specialize llama.cpp setup
everything else might be same.
Results for 7 Runs:
| Run | Task Type | Duration | Gen Speed | Peak Context | Quality | Key Finding |
|---|---|---|---|---|---|---|
| 1 | File ops (ls, cat) | 1m44s | 9.71 t/s | 23K | Correct | Baseline: fast at low context |
| 2 | Git clone + code read | 2m31s | 9.56 t/s | 32.5K | Excellent | Tool chaining works well |
| 3 | 7-day plan + guide | 4m57s | 8.37 t/s | 37.9K | Excellent | Long-form generation quality |
| 4 | Skills assessment | 4m36s | 8.46 t/s | 40K | Very good | Web search broken (needs Anthropic) |
| 5 | Write Python script | 10m25s | 7.54 t/s | 60.4K | Good (7/10) | |
| 6 | Code review + fix | 9m29s | 7.42 t/s | 65,535 CRASH | Very good (8.5/10) | Context wall hit, no auto-compact |
| 7 | /compact command | ~10m | ~8.07 t/s | 66,680 (failed) | N/A | Output token limit too low for compaction |
Lessons
- Generation speed degrades ~24% across context range: 9.71 t/s (23K) down to 7.42 t/s (65K)
- Claude Code System prompt = 22,870 tokens (35% of 65K budget)
- Auto-compaction was completely broken: Claude Code assumed 200K context, so 95% threshold = 190K. 65K limit was hit at 33% of what Claude Code thought was the window.
/compactneeds output headroom: At 4096 max output, the compaction summary can't fit. Needs 16K+.- Web search is dead without Anthropic (Run 4): Solution is SearXNG via MCP or if someone has better solution, please suggest.
- LCP prefix caching works great:
sim_best = 0.980means the system prompt is cached across turns - Code quality is solid but instructions need precision: I plan to add second reviewer agent to suggest fixes.
VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB (CC is super heavy)
Second Session
claude/settings.json config:
{ "env": { "ANTHROPIC_BASE_URL": "http://127.0.0.1:8001", "ANTHROPIC_MODEL": "qwen3.5-27b", "ANTHROPIC_SMALL_FAST_MODEL": "qwen3.5-27b", "ANTHROPIC_API_KEY": "sk-no-key-required", "ANTHROPIC_AUTH_TOKEN": "", "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1", "DISABLE_COST_WARNINGS": "1", "CLAUDE_CODE_ATTRIBUTION_HEADER": "0", "CLAUDE_CODE_DISABLE_1M_CONTEXT": "1", "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "32768", "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "65536", "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "90", "DISABLE_PROMPT_CACHING": "1", "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1", "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1", "MAX_THINKING_TOKENS": "0", "CLAUDE_CODE_DISABLE_FAST_MODE": "1", "DISABLE_INTERLEAVED_THINKING": "1", "CLAUDE_CODE_MAX_RETRIES": "3", "CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1", "DISABLE_TELEMETRY": "1", "CLAUDE_CODE_MAX_TOOL_USE_CONCURRENCY": "1", "ENABLE_TOOL_SEARCH": "auto", "DISABLE_AUTOUPDATER": "1", "DISABLE_ERROR_REPORTING": "1", "DISABLE_FEEDBACK_COMMAND": "1" } } llama.cpp run:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \ --alias "qwen3.5-27b" \ --port 8001 \ --ctx-size 65536 \ --n-gpu-layers 999 \ --flash-attn on \ --jinja \ --threads 8 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --cache-type-k q8_0 \ --cache-type-v q8_0 claude --model qwen3.5-27b --verbose
VRAM Consumed - 22GB
RAM Consumed (by CC) - 7GB
nothing changed.
all the errors from first session were fixed )
Third Session (Vision)
To turn on vision for qwen, you are required to use mmproj, which was included with gguf.
setup:
ROCBLAS_USE_HIPBLASLT=1 ./build/bin/llama-server \ --model models/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \ --alias "qwen3.5-27b" \ --port 8001 \ --ctx-size 65536 \ --n-gpu-layers 999 \ --flash-attn on \ --jinja \ --threads 8 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --mmproj models/Qwen3.5-27B-GGUF/mmproj-F32.gguf and its only added 1-2 ram usage.
tested with 8 Images and quality of vision was WOW to me.
if you look at Artificial Analysis Vision Benchmark, qwen is on [Claude 4.6 Opus](Claude 4.6 Opus) level which makes it superior for vision tasks.
My tests showed that it can really good understand context of image and handwritten diagrams.
Verdict
- system prompt is too big and takes too much time to load. but this is only first time, then caching makes everything for you.
- CC is worth using with local models and local models nowadays are good for coding tasks. and I found it most "offline" coding agent CLI compared to [Opencode](Opencode), why I should use less "performant" alternative, when I can use SOTA )
Future Experiments:
- I want to use bigger [Mixture of Experts](Mixture of Experts) model from [Qwen3.5](Qwen3.5) Family, but will it give me better 2x performance for 2x size?
- want to try CC with [Zed](Zed) editor, and check how offline zed will behave with local CC.
- How long compaction will hold agents reasoning and how quality gonna degrade, with codex or CC I had 10M context chats with decent quality compared to size.
[link] [comments]




