Pushing a 5-Year-Old 6GB VRAM laptop to Its Limits: Qwen3.6-35B-A3B

Reddit r/LocalLLaMA / 5/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

A contributor documents how they managed to run the Qwen3.6-35B-A3B model locally on an Asus ROG Zephyrus G14 (2020) with only a 6GB VRAM RTX 2060 Max-Q, achieving usable performance around ~23 tokens/second and 10+ tokens/second when unplugged.
The setup uses specific local inference commands (llama-server / lm-server-tq) with tuned parameters for long context (up to 128k context in the provided “Tom's fork” configuration) and performance/memory controls such as KV cache settings and CPU/GPU handling.
The author highlights that the model is practical for agent workflows, stating it is very good to use with “pi agent,” suggesting real-world usability beyond pure experimentation.
They provide full “localmaxxing” documentation on a blog post and invite feedback for further improvements, positioning the article as a reproducible performance tuning guide for constrained hardware.
Overall, the post emphasizes that open models have improved enough to make even heavily memory-limited laptops viable for running sizable LLMs with careful optimization.

For the past few weeks, I have been trying to get this model working on my hardware. It still feels incredible how much better open models have become. I couldn't have gotten this model to work on my 5yo laptop if not for this sub and its amazing people. The model is actually usable at ~23 t/s...even getting 10+ t/s when unplugged! It is very good to use with pi agent.

If you think this setup can be improved, I'd love to know more...

I've documented my full localmaxxing journey on my blog post here, someone might find it helpful.

TL;DR

Laptop: Asus ROG Zephyrus G14 2020

CPU: Ryzen 7 (8c 16t) @ 2900 Mhz (boost disabled)

Mem: 24GB DDR4-3200 RAM

GPU: RTX 2060 Max-Q 6GB VRAM

General:

#!/bin/bash llama-server \ -m ~/dev/models/Qwen3.6-35B-A3B-APEX-GGUF/Qwen3.6-35B-A3B-APEX-I-Compact.gguf \ -mm ~/dev/models/Qwen3.6-35B-A3B-GGUF/mmproj-F16.gguf \ --no-mmproj-offload \ -a Qwen3.6-35B-A3B-APEX-64k \ --host 0.0.0.0 --port 8000 \ --fit off -fa on \ --ctx-size 65536 \ --threads 8 --threads-batch 12 \ --cpu-range 0-7 --cpu-strict 1 \ --cpu-range-batch 0-11 --cpu-strict-batch 1 \ --numa isolate \ --prio 2 \ --no-mmap --parallel 1 --jinja \ --cache-type-k q8_0 --cache-type-v q8_0 \ --ubatch-size 1024 --batch-size 2048 \ --n-cpu-moe 36 \ --cache-reuse 256 \ --ctx-checkpoints 8 \ --metrics \ --cache-ram 4096 \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48

Long Context: (Tom's fork)

#!/bin/bash lm-server-tq \ -m ~/dev/models/Qwen3.6-35B-A3B-APEX-GGUF/Qwen3.6-35B-A3B-APEX-I-Compact.gguf \ -a Qwen3.6-35B-A3B-APEX-128k \ --host 0.0.0.0 --port 8000 \ --fit off -fa on \ --ctx-size 131072 \ --threads 8 --threads-batch 12 \ --cpu-range 0-7 --cpu-strict 1 \ --cpu-range-batch 0-11 --cpu-strict-batch 1 \ --numa isolate \ --prio 2 \ --no-mmap --parallel 1 --jinja \ --cache-type-k turbo3 --cache-type-v turbo4 \ --ubatch-size 1024 --batch-size 2048 \ --n-cpu-moe 36 \ --cache-reuse 256 \ --ctx-checkpoints 8 \ --metrics \ --cache-ram 4096 \ --spec-type ngram-mod \ --spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 12 --spec-ngram-mod-n-max 48

submitted by /u/abhinand05
[link] [comments]