| Hi everyone, Been following a lot of local LLM talk in this forum lately—learned quite a bit from you all! This is my first post, hopefully not my last. I wanted to share some interesting benchmarks I did in my free time testing out a dual-GPU setup. Hardware Specs:
Software Setup:
The "Llama_benchy" Metrics:
I’ve had a blast with the Qwen3.5 series lately—especially the 35BA3B model. It was already fast on my old setup (4070 + RAM offload), but adding the RTX 3060 gives me way more headroom. I tested these 4 models:
All models used max_concurrent_preds=1, full GPU offload, and flash attention enabled. Benchmark Results:Prompt Processing Speed - Dual GPU Time to first response - Dual GPU Analysis:
The "New GPU" Comparison I wanted to see how much the RTX 3060 actually helped my favorite model, Qwen3.5 35B-A3B, compared to my old setup (4070 + CPU + RAM offload): Analysis: Prompt Processing - Dual vs Single GPU Token Generation Throughput - Dual vs Single GPU Time to first response - Dual vs Single GPU
VRAM & Utilization Notes: I didn't get perfect readings (mostly just Task Manager), so take this with a grain of salt. The RTX 4070 hovered around 40-45% utilization, while the 3060 was between 50-60%. The memory split was a bit weird; despite the 4070 being primary, the 3060 always seemed to take a slightly larger chunk of VRAM (about 300–400 MB more), excluding the base Windows usage.
Conclusions:
Final advice: If you’re on the fence about a dual-GPU setup, go for it! Just keep realistic expectations—it's amazing for hobbyist use and honestly just a lot of fun to hunt for deals, installing them and playing around with. If anyone has suggestions to improve my setup or tools for objective quality testing, please let me know! Closing remarks: I corrected the text for grammar issues with Gemma4-26B-A4B at the end: It was quite fast but kept insisting that qwen2.5 and gemma2 are the latest models – and added that I would lose credibility if I don’t use the correct version numbers😂
[link] [comments] |
Performance Benchmark - Qwen3.5 & Gemma4 on dual GPU setup (RTX 4070 + RTX 3060)
Reddit r/LocalLLaMA / 4/14/2026
📰 News
Key Points
- A user shared local LLM performance benchmarks comparing Qwen3.5 and Gemma4 models when run on a dual-GPU Windows 11 setup (RTX 4070 primary, RTX 3060 secondary via PCIe x2 slot).
- The test uses LMStudio (v0.4.11) with a split strategy prioritizing the 4070 and relaxed loading guardrails, measuring throughput with pp12000, tg32, and tg4096 across different prompt and generation lengths.
- Reported results suggest the added RTX 3060 provides “headroom” for Qwen3.5 (notably 35BA3B) versus an earlier single-GPU approach that relied more on RAM offload.
- The benchmarking focuses on 50k context variants of two Qwen3.5 GGUF models (Q4KS and Q4KM) and a Gemma4 26B GGUF model (A4B-it), aiming to simulate real “open code” and short/long reply workloads.
- The post is primarily a practical hardware/software configuration and measurement report for enthusiasts optimizing local inference performance and VRAM capacity.
- categories: [