I am trying to optimize Qwen 3.6 35B A3B sampling parameters but I am having a hard time figuring out a good benchmark to do it.
As to why I believe that the recommended settings may not be optimal? One reason is that they recommend the same ones for Qwen 3.5 and 3.6 yet when I upgraded to 3.6 with everything else being identical (even the same quant) 3.6 was getting stuck in tool call loops in some programmed daily tasks in which 3.5 was not and the solution was bumping the temperature up. Another is that their numbers are round and typical values which likely means that no extensive fine tuning was done.
I am also quite suspicious of the min_p=0.0 reccomendation being actually optimal. A small min_p value would likely allow relaxing other samplers being less restrictive towards plausible tokens but more about the less plausible ones than the current configs.
I have tried GSM8K and the metabench subset of GSM8K, IFEval and GPQA diamond.
GSM8K and IFEval are too saturated.
The metabench subset of GSM8K is not saturated but has at least a 20% run to run variance.
GPQA Diamond is better behaved but has at least 2.5% of variance and each run in my 3090 takes almost 3 h, so to get a clean signal I would likely need 10 runs per setting.
My plan was to do a 10 points univariate search centered against the average of Qwen recommended ranges with the exception of min_p as they recommend 0.0.
Then using that to determine the ranges of a grid search with 3 values per parameter (the univariate optimal and the points at which it has fallen 50% of what it can fall over the whole range).
Then from the optimal cell run Optuna to try squeezing the last bit.
The problem is that with temperature, top_p, top_k and min_p alone the first phase is 40 points (more if the optimals are too off center as some extra runs would be needed), the second 81 and the third who knows?
So the first two phases alone in my GPU are a solid 5 months of compute and next Qwen will likely be out by then.
There was a previous 3.5 thread but it was mostly vibes about what settings may be better: https://old.reddit.com/r/LocalLLaMA/comments/1ryb028/qwen35_best_parameters_collection/
Maybe there isn't a good quick and low variance benchmark that would discern between configurations. As to actually benchmark sampling differences you can't use logprobs benchmarks (or I don't know any way) and you need to use generative benchmarks. There are less of those and are way slower.
Also the sampling itself introduces variance and it may very well be that when sampling is involved you need a ton of questions to average that out.
So leaving this here in case someone either knows a better set of benchmarks that would complete in a reasonable amount of time with my 3090, or a better way to evaluate or someone compute rich happens to want to squeeze the last drop out of Qwen.
[link] [comments]
