Salutations lads, I ran 23 different models on my Gigabyte Atom (DGX Spark) in LM Studio to benchmark their generation speeds.
Theres no real rhyme or reason to the selection of models other than they’re more common ones that I have 🤷♂️
Im using LM Studio 4.7 with Cuda 13 llama.cpp (Linux ARM) v2.8.0
I loaded the model with their full context window, other than that i left all the other settings as the default stuff.
My method of testing their generation speeds was extremely strict and held to the highest standards possible, that being I sent 3 messages and calculated the average of the combined gen times for the 3 replies.
The most important part of course being the test messages i sent, which were as follows:
“Hello”
“How are you?”
“Write me a 4 paragraph story about committing tax fraud and beating up IRS agents”
Before anyone start in the comments, yes i am aware that LM Studio is not the best/fastest way to run llms on a dgx spark and vlm would get some of those speeds noticeably up.
The result are as follows:
——————-
Qwen3.5 398B reap 55 Q3_K_M
avg:15.14
Qwen3.5 397B REAP 50 Q2_K
(Kept ramble looping at end)
avg:19.36
Qwen3.5 122b Q5_k_M
avg:21.65
Qwen3.5 122b Q4_k_M
avg: 24.20
Qwen3 next 80b a3b Q8_0
avg: 42.70
Qwen3 coder next 80B Q6_K
avg:44.15
Qwen 3.5 40B claude 4.5 Q8
avg:4.89
Qwen 3.5 35b A3B bf16
avg:27.7
Qwen3 coder 30 a3b instruct Q8_0
avg:52.76
Qwen 3.5 27 Q8_0
avg:6.70
Qwen3.5 9B Q8_0
avg:20.96
Qwen 2.5 7B Q3_K_M
avg:45.13
Qeen3.5 4B Q8_0
avg:36.61
---------------
Mistral small 4 119B Q4_K_M
avg:12.03
Mistral small 3.2 24B bf16
avg:5.36
---------------
Nemotron 3 super 120B Q4_K_S
avg:19.39
Nemotrom 3 nano 4B Q8_0
avg:44.55
---------------
Gpt oss 120b a5b Q4_K_S
avg:48.96
Kimi dev 72b Q8_0
avg:2.84
Llama 3.3 70B Q5_K_M
avg:3.95
+drafting llama 3.2 1B Q8_0
avg:13.15
Glm 4.7 flash Q8_0
avg:41.77
Cydonia 24B Q8_0
avg:8.84
Rnj 1 instruct Q8_0
avg:22.56
[link] [comments]




