Great results with Qwen3.6-35B-A3B-UD-Q5_K_XL + VS Code and Copilot

Reddit r/LocalLLaMA / 5/7/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

A developer reports strong results running the Qwen3.6-35B-A3B-UD-Q5_K_XL model via llama-cpp on a single AMD R9700 GPU, using a large prompt to generate a functional website and then a Playwright test suite.
They needed to “nudge” the AI only once to fix one failing test at a time, and the website worked on the first run, reducing iteration effort significantly.
The post provides concrete llama-cpp startup parameters (e.g., context size, GPU layers, flash attention, cache settings, and sampling-related knobs) intended to achieve “precise coding” behavior.
It also includes a VS Code chatLanguageModels.json configuration for integrating the local model with VS Code + Copilot-style tooling, specifying token limits, sampling (temperature/top_p/top_k), and a coding-focused system prompt.
The sampling and tuning choices are described as being sourced from the Hugging Face Qwen3.6-35B-A3B model card’s “precise coding” section.

Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700. The settings below are showing great results with a large prompt to generate a test website that ChatGPT gave me. I then ran a prompt to generate a full suite of Playwright tests.

I only had to nudge it once when creating the tests to tell it to fix one failing test at a time. The website was fully functional on first run.

I think I am done tweaking and testing models (until the next big release) and can get back to coding now...

llama-cpp | ========== LLAMA.CPP STARTUP COMMAND ========== llama-cpp | /app/llama-server -m /models/Qwen3.6-35B-A3B-UD-Q5_K_XL/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --ctx-size 262144 --threads 8 --threads-batch 8 --gpu-layers 99 --parallel 1 --flash-attn on --batch-size 2048 --ubatch-size 1024 --cache-type-k q8_0 --cache-type-v q8_0 --cache-ram 12000 --ctx-checkpoints 50 --mmap --no-mmproj --kv-unified --reasoning off --reasoning-budget 0 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0

Settings for sampling come from https://huggingface.co/Qwen/Qwen3.6-35B-A3B under the "precise coding" section.

VS Code chatLanguageModels.json :

 { "name": "Sean Llama.cpp", "vendor": "customoai", "apiKey": "${input:chat.lm.secret.3c0c0f21}", "models": [ { "id": "Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf", "name": "Qwen3.6-35B", "url": "https://llm.home.arpa/v1/chat/completions", "toolCalling": true, "vision": false, "maxInputTokens": 180000, "maxOutputTokens": 10000, "family": "Qwen3", "inputTokenCost": 0.0001, "outputTokenCost": 0.0001, "temperature": 0.6, "top_p": 0.95, "top_k": 20, "repeat_penalty": 1, "presence_penalty": 0, "frequency_penalty": 0, "systemMessage": "You are a precise coding assistant. Avoid repeating plans. Execute tasks directly. Do not restate intentions multiple times.", "timeout": 600000, "retry": { "enabled": true, "max_attempts": 2, "interval_ms": 1500 } } ] }

ChatGPT Generated test prompt :

You are working in a clean Vite + React + TypeScript project. Your task is to build a medium-complexity local-first website called “Bike Shop Service Tracker.” The app should help a small bike shop track incoming bike repair jobs. It should run entirely in the browser using localStorage. Do not use a backend, database, authentication, external API, router, Tailwind, shadcn, Redux, Zustand, or any complex setup. Use only: - React - TypeScript - plain CSS or CSS modules - browser localStorage - lucide-react only if already installed Before implementing, inspect the existing project structure briefly. Then create a concise implementation plan. After the plan, implement the app directly. Do not get stuck repeating the plan. Functional requirements: 1. Main layout - Create a polished single-page dashboard. - Header with app name: “Bike Shop Service Tracker.” - A summary area showing: - total open jobs - jobs due today - overdue jobs - completed jobs - Main content split into: - job creation/edit form - job list and filters 2. Repair job data model Each repair job should include: - id - customerName - customerPhone - bikeDescription - serviceType - priority: low, normal, high, urgent - status: intake, waiting-parts, in-progress, ready, completed - dueDate - notes - createdAt - updatedAt 3. Seed data - If localStorage is empty, create 6 realistic sample repair jobs. - Include different statuses, priorities, due dates, and service types. - Make at least one job overdue and one job due today. 4. Job form - Allow creating a new job. - Allow editing an existing job. - Include basic validation: - customer name required - phone required - bike description required - service type required - due date required - Show clear validation messages. - Include buttons: - Save Job - Cancel Edit, when editing - Clear Form 5. Job list - Display jobs as cards or table rows. - Each job should show: - customer name - bike description - service type - priority - status - due date - notes preview - Include actions: - edit - mark completed - delete 6. Filtering and sorting Add controls for: - search by customer name, phone, bike description, service type, or notes - filter by status - filter by priority - sort by due date - sort by priority - sort by created date 7. UX behavior - Persist all jobs to localStorage. - Keep the UI responsive and usable on desktop and narrow screens. - Use clear visual differences for priority and status. - Completed jobs should look visually quieter than active jobs. - Overdue jobs should be visually obvious. - Due-today jobs should be highlighted differently from overdue jobs. - Deleting a job should ask for confirmation. 8. Code organization Create a reasonable file structure, for example: - src/App.tsx - src/App.css - src/types.ts - src/data/seedJobs.ts - src/utils/storage.ts - src/utils/jobFilters.ts - src/components/JobForm.tsx - src/components/JobList.tsx - src/components/JobCard.tsx - src/components/SummaryCards.tsx - src/components/FilterBar.tsx Use this structure only if it makes sense. Do not over-engineer. 9. Quality requirements - TypeScript should compile cleanly. - Avoid `any` unless there is a strong reason. - Keep functions readable. - Avoid huge components where possible. - The app should run with: npm run dev 10. Final response When finished, summarize: - what files were created or changed - how to run the app - what features were implemented - any limitations or follow-up improvements Important behavior instructions: - Do not only create todos. - Do not repeatedly say “I need to implement this now.” - Do not repeatedly read the same files unless needed. - Once you have enough context, start editing files. - Prefer completing a working first version over designing a perfect architecture.

I fed the Llama logs back into local qwen (a bit ironic) and it gave this summary of its own performance from a big copilot/autopilot session that was very successful :

Prompt : analyze these llamma cpp logs and let me know how performance looks :

Based on the logs provided, here is a detailed performance analysis of your llama.cpp instance.

Executive Summary

Overall Health: Good/Healthy. The server is functioning correctly, handling requests without crashing or hanging.
Throughput: Moderate. You are seeing roughly 94–105 tokens per second (t/s) for generation. This is typical for mid-range consumer GPUs or efficient CPU inference, depending on the specific hardware model (GGUF quantization) running.
Prompt Processing: Variable. This is the biggest bottleneck. Some prompts are processed quickly via cache hits, while others force a full re-evaluation, leading to high latency spikes.
Memory Usage: Efficient but Growing. VRAM/RAM usage is scaling with context length (up to ~723 MiB for 4 prompts), well within the 12GB limit.

Detailed Breakdown

1. Generation Speed (Eval Time)

This is the speed at which the model generates new text tokens.

Performance: ~93–105 tokens per second.
Log Evidence:
- Task 460: 104.58 tokens per second
- Task 776: 102.35 tokens per second
- Task 2848: 101.73 tokens per second
- Task 3079: 94.12 tokens per second
- Task 4726: 93.38 tokens per second
Analysis: This is consistent and stable. The slight drop in speed as context grows (from 104 to 93 t/s) is expected due to increased memory bandwidth pressure and attention calculation overhead as the context window fills up. This is a healthy slope.

2. Prompt Processing (Prefill Time)

This is the time taken to process the input prompt. This varies significantly in your logs.

Scenario A: Cache Hit (Fast)
- Task 2848: 856.58 ms for 2,098 tokens (~2,449 t/s).
- Why: The log shows restored context checkpoint. The system found a previous state similar enough to resume from, avoiding full re-processing.
- Performance: Excellent.
Scenario B: Cache Miss / Full Re-eval (Slow)
- Task 460: 68.01 ms for 27 tokens. (Fast because the prompt is tiny).
- Task 776: 1944.26 ms for 5,783 tokens (~2,974 t/s).
- Task 3079: 17307.06 ms for 38,012 tokens (~2,196 t/s).
- Task 4726: 1153.57 ms for 1,652 tokens (~1,432 t/s).
- Why: The log explicitly states: forcing full prompt re-processing due to lack of cache data.
- Analysis:
  - Task 3079 is the most concerning. It took 17.3 seconds just to read the prompt before generating a single word. This is likely due to the prompt being very long (38k tokens) and a cache miss.
  - Task 4726 was slower than Task 2848 despite having a similar context size, likely because it didn't find a perfect cache match (sim_best = 0.958 is good, but perhaps not enough to skip the entire prefill, or the checkpoint restoration had overhead).

3. Context Management & Caching

Cache Mechanism: You are using n_ctx_slot = 262144 (256k context window) and n_keep = 0.
Checkpointing: The system is actively creating checkpoints (created context checkpoint X of 50).
- Each checkpoint is ~62.8 MiB.
- Task 3079 created 6 checkpoints during processing.
Issue Identified: The log forcing full prompt re-processing appears frequently when switching between very different context lengths or when the similarity threshold isn't met.
- Task 776 (5,783 tokens) had n_past = 23 from the previous slot. It checked checkpoints but failed to use them, forcing a full re-process. This suggests the previous context (Task 460, 27 tokens) was too different to be useful for Task 776.
- Task 3079 (38,012 tokens) also failed to use previous checkpoints, likely because the previous context (Task 2848, ~7.8k tokens) was not similar enough or the position offsets didn't align well for the KV cache.

4. Resource Utilization

Memory:
- Total state size reached 723.154 MiB for 4 active prompts.
- Limit is 12,000 MiB.
- Status: Very low memory usage. You have plenty of headroom.
Latency:
- Total response times vary from 3 seconds (short prompts) to 34.5 seconds (long prompt prefill + generation).
- The 34.5s delay is entirely due to the prompt processing of the 38k token input.

Recommendations for Improvement

Optimize Context Window Usage:
- If you are sending very long prompts (like the 38k token one), consider if you can truncate or summarize them. Processing 38k tokens takes ~17 seconds.
- If these long prompts are frequent, consider increasing --ctx-size if you aren't already, but more importantly, ensure your application is reusing the same context slot for similar queries to maximize cache hits.
Tune --cache-reuse (if applicable):
- The logs show sim_best values. If you are seeing many "full prompt re-processing" events, you might experiment with changing the similarity threshold or cache reuse settings if your version of llama.cpp supports fine-tuning these. However, the default behavior is usually robust.
Monitor "Slot" Reuse:
- You are currently using id 0 for all tasks (slot get_availabl: id 0). This is good for caching (keeping one long context alive), but it means requests are processed sequentially if they overlap. If you have many concurrent users, consider enabling multiple slots (--parallel or similar depending on your setup) to improve concurrency, though this will increase memory usage.
Investigate the 38k Token Prompt:
- Task 3079 is the outlier. Why is a prompt 38k tokens long?
  - If this is a chat history, consider implementing a "context compression" strategy in your application layer (e.g., summarizing old messages) before sending to the LLM.
  - If this is RAG (Retrieval Augmented Generation), ensure you are not retrieving more chunks than necessary.

Conclusion

Your llama.cpp setup is performing normally. The generation speed is steady (~95-105 t/s). The main "performance hit" is the variable latency caused by prompt processing, especially for long contexts where cache misses force full re-evaluation. This is an algorithmic/usage pattern issue, not a hardware bottleneck.

No errors or crashes detected. The system is stable.

submitted by /u/supracode
[link] [comments]