Long post, but hopefully helps somebody. Llama-cpp vulkan server running single AMD R9700. The settings below are showing great results with a large prompt to generate a test website that ChatGPT gave me. I then ran a prompt to generate a full suite of Playwright tests.
I only had to nudge it once when creating the tests to tell it to fix one failing test at a time. The website was fully functional on first run.
I think I am done tweaking and testing models (until the next big release) and can get back to coding now...
llama-cpp | ========== LLAMA.CPP STARTUP COMMAND ========== llama-cpp | /app/llama-server -m /models/Qwen3.6-35B-A3B-UD-Q5_K_XL/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf --ctx-size 262144 --threads 8 --threads-batch 8 --gpu-layers 99 --parallel 1 --flash-attn on --batch-size 2048 --ubatch-size 1024 --cache-type-k q8_0 --cache-type-v q8_0 --cache-ram 12000 --ctx-checkpoints 50 --mmap --no-mmproj --kv-unified --reasoning off --reasoning-budget 0 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --repeat-penalty 1.0 --presence-penalty 0.0 Settings for sampling come from https://huggingface.co/Qwen/Qwen3.6-35B-A3B under the "precise coding" section.
VS Code chatLanguageModels.json :
{ "name": "Sean Llama.cpp", "vendor": "customoai", "apiKey": "${input:chat.lm.secret.3c0c0f21}", "models": [ { "id": "Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf", "name": "Qwen3.6-35B", "url": "https://llm.home.arpa/v1/chat/completions", "toolCalling": true, "vision": false, "maxInputTokens": 180000, "maxOutputTokens": 10000, "family": "Qwen3", "inputTokenCost": 0.0001, "outputTokenCost": 0.0001, "temperature": 0.6, "top_p": 0.95, "top_k": 20, "repeat_penalty": 1, "presence_penalty": 0, "frequency_penalty": 0, "systemMessage": "You are a precise coding assistant. Avoid repeating plans. Execute tasks directly. Do not restate intentions multiple times.", "timeout": 600000, "retry": { "enabled": true, "max_attempts": 2, "interval_ms": 1500 } } ] } ChatGPT Generated test prompt :
You are working in a clean Vite + React + TypeScript project. Your task is to build a medium-complexity local-first website called “Bike Shop Service Tracker.” The app should help a small bike shop track incoming bike repair jobs. It should run entirely in the browser using localStorage. Do not use a backend, database, authentication, external API, router, Tailwind, shadcn, Redux, Zustand, or any complex setup. Use only: - React - TypeScript - plain CSS or CSS modules - browser localStorage - lucide-react only if already installed Before implementing, inspect the existing project structure briefly. Then create a concise implementation plan. After the plan, implement the app directly. Do not get stuck repeating the plan. Functional requirements: 1. Main layout - Create a polished single-page dashboard. - Header with app name: “Bike Shop Service Tracker.” - A summary area showing: - total open jobs - jobs due today - overdue jobs - completed jobs - Main content split into: - job creation/edit form - job list and filters 2. Repair job data model Each repair job should include: - id - customerName - customerPhone - bikeDescription - serviceType - priority: low, normal, high, urgent - status: intake, waiting-parts, in-progress, ready, completed - dueDate - notes - createdAt - updatedAt 3. Seed data - If localStorage is empty, create 6 realistic sample repair jobs. - Include different statuses, priorities, due dates, and service types. - Make at least one job overdue and one job due today. 4. Job form - Allow creating a new job. - Allow editing an existing job. - Include basic validation: - customer name required - phone required - bike description required - service type required - due date required - Show clear validation messages. - Include buttons: - Save Job - Cancel Edit, when editing - Clear Form 5. Job list - Display jobs as cards or table rows. - Each job should show: - customer name - bike description - service type - priority - status - due date - notes preview - Include actions: - edit - mark completed - delete 6. Filtering and sorting Add controls for: - search by customer name, phone, bike description, service type, or notes - filter by status - filter by priority - sort by due date - sort by priority - sort by created date 7. UX behavior - Persist all jobs to localStorage. - Keep the UI responsive and usable on desktop and narrow screens. - Use clear visual differences for priority and status. - Completed jobs should look visually quieter than active jobs. - Overdue jobs should be visually obvious. - Due-today jobs should be highlighted differently from overdue jobs. - Deleting a job should ask for confirmation. 8. Code organization Create a reasonable file structure, for example: - src/App.tsx - src/App.css - src/types.ts - src/data/seedJobs.ts - src/utils/storage.ts - src/utils/jobFilters.ts - src/components/JobForm.tsx - src/components/JobList.tsx - src/components/JobCard.tsx - src/components/SummaryCards.tsx - src/components/FilterBar.tsx Use this structure only if it makes sense. Do not over-engineer. 9. Quality requirements - TypeScript should compile cleanly. - Avoid `any` unless there is a strong reason. - Keep functions readable. - Avoid huge components where possible. - The app should run with: npm run dev 10. Final response When finished, summarize: - what files were created or changed - how to run the app - what features were implemented - any limitations or follow-up improvements Important behavior instructions: - Do not only create todos. - Do not repeatedly say “I need to implement this now.” - Do not repeatedly read the same files unless needed. - Once you have enough context, start editing files. - Prefer completing a working first version over designing a perfect architecture. I fed the Llama logs back into local qwen (a bit ironic) and it gave this summary of its own performance from a big copilot/autopilot session that was very successful :
Prompt : analyze these llamma cpp logs and let me know how performance looks :
Based on the logs provided, here is a detailed performance analysis of your llama.cpp instance.
Executive Summary
- Overall Health: Good/Healthy. The server is functioning correctly, handling requests without crashing or hanging.
- Throughput: Moderate. You are seeing roughly 94–105 tokens per second (t/s) for generation. This is typical for mid-range consumer GPUs or efficient CPU inference, depending on the specific hardware model (GGUF quantization) running.
- Prompt Processing: Variable. This is the biggest bottleneck. Some prompts are processed quickly via cache hits, while others force a full re-evaluation, leading to high latency spikes.
- Memory Usage: Efficient but Growing. VRAM/RAM usage is scaling with context length (up to ~723 MiB for 4 prompts), well within the 12GB limit.
Detailed Breakdown
1. Generation Speed (Eval Time)
This is the speed at which the model generates new text tokens.
- Performance: ~93–105 tokens per second.
- Log Evidence:
- Task 460:
104.58 tokens per second - Task 776:
102.35 tokens per second - Task 2848:
101.73 tokens per second - Task 3079:
94.12 tokens per second - Task 4726:
93.38 tokens per second
- Task 460:
- Analysis: This is consistent and stable. The slight drop in speed as context grows (from 104 to 93 t/s) is expected due to increased memory bandwidth pressure and attention calculation overhead as the context window fills up. This is a healthy slope.
2. Prompt Processing (Prefill Time)
This is the time taken to process the input prompt. This varies significantly in your logs.
- Scenario A: Cache Hit (Fast)
- Task 2848:
856.58 msfor 2,098 tokens (~2,449 t/s). - Why: The log shows
restored context checkpoint. The system found a previous state similar enough to resume from, avoiding full re-processing. - Performance: Excellent.
- Task 2848:
- Scenario B: Cache Miss / Full Re-eval (Slow)
- Task 460:
68.01 msfor 27 tokens. (Fast because the prompt is tiny). - Task 776:
1944.26 msfor 5,783 tokens (~2,974 t/s). - Task 3079:
17307.06 msfor 38,012 tokens (~2,196 t/s). - Task 4726:
1153.57 msfor 1,652 tokens (~1,432 t/s). - Why: The log explicitly states:
forcing full prompt re-processing due to lack of cache data. - Analysis:
- Task 3079 is the most concerning. It took 17.3 seconds just to read the prompt before generating a single word. This is likely due to the prompt being very long (38k tokens) and a cache miss.
- Task 4726 was slower than Task 2848 despite having a similar context size, likely because it didn't find a perfect cache match (
sim_best = 0.958is good, but perhaps not enough to skip the entire prefill, or the checkpoint restoration had overhead).
- Task 460:
3. Context Management & Caching
- Cache Mechanism: You are using
n_ctx_slot = 262144(256k context window) andn_keep = 0. - Checkpointing: The system is actively creating checkpoints (
created context checkpoint X of 50).- Each checkpoint is ~62.8 MiB.
- Task 3079 created 6 checkpoints during processing.
- Issue Identified: The log
forcing full prompt re-processingappears frequently when switching between very different context lengths or when the similarity threshold isn't met.- Task 776 (5,783 tokens) had
n_past = 23from the previous slot. It checked checkpoints but failed to use them, forcing a full re-process. This suggests the previous context (Task 460, 27 tokens) was too different to be useful for Task 776. - Task 3079 (38,012 tokens) also failed to use previous checkpoints, likely because the previous context (Task 2848, ~7.8k tokens) was not similar enough or the position offsets didn't align well for the KV cache.
- Task 776 (5,783 tokens) had
4. Resource Utilization
- Memory:
- Total state size reached 723.154 MiB for 4 active prompts.
- Limit is 12,000 MiB.
- Status: Very low memory usage. You have plenty of headroom.
- Latency:
- Total response times vary from 3 seconds (short prompts) to 34.5 seconds (long prompt prefill + generation).
- The 34.5s delay is entirely due to the prompt processing of the 38k token input.
Recommendations for Improvement
- Optimize Context Window Usage:
- If you are sending very long prompts (like the 38k token one), consider if you can truncate or summarize them. Processing 38k tokens takes ~17 seconds.
- If these long prompts are frequent, consider increasing
--ctx-sizeif you aren't already, but more importantly, ensure your application is reusing the same context slot for similar queries to maximize cache hits.
- Tune
--cache-reuse(if applicable):- The logs show
sim_bestvalues. If you are seeing many "full prompt re-processing" events, you might experiment with changing the similarity threshold or cache reuse settings if your version ofllama.cppsupports fine-tuning these. However, the default behavior is usually robust.
- The logs show
- Monitor "Slot" Reuse:
- You are currently using
id 0for all tasks (slot get_availabl: id 0). This is good for caching (keeping one long context alive), but it means requests are processed sequentially if they overlap. If you have many concurrent users, consider enabling multiple slots (--parallelor similar depending on your setup) to improve concurrency, though this will increase memory usage.
- You are currently using
- Investigate the 38k Token Prompt:
- Task 3079 is the outlier. Why is a prompt 38k tokens long?
- If this is a chat history, consider implementing a "context compression" strategy in your application layer (e.g., summarizing old messages) before sending to the LLM.
- If this is RAG (Retrieval Augmented Generation), ensure you are not retrieving more chunks than necessary.
- Task 3079 is the outlier. Why is a prompt 38k tokens long?
Conclusion
Your llama.cpp setup is performing normally. The generation speed is steady (~95-105 t/s). The main "performance hit" is the variable latency caused by prompt processing, especially for long contexts where cache misses force full re-evaluation. This is an algorithmic/usage pattern issue, not a hardware bottleneck.
No errors or crashes detected. The system is stable.
[link] [comments]




