Optimizing Qwen 3.6 35B A3B sampling parameters.

Reddit r/LocalLLaMA / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The author is trying to optimize Qwen 3.6 35B A3B sampling parameters but struggles to find a benchmark that can reliably distinguish between configurations.
They argue that Qwen’s recommended settings may not generalize from Qwen 3.5 to 3.6 because 3.6 can get stuck in tool-call loops in daily task workflows where 3.5 does not, and raising temperature fixes it.
The author questions the recommended min_p=0.0, hypothesizing that it may not be optimal because it could over-relax or behave differently relative to other samplers, affecting plausible vs. implausible tokens.
Their attempted benchmark suite (GSM8K, IFEval, a GSM8K metabench subset, and GPQA Diamond) shows saturation in some cases and significant run-to-run variance in others, making it hard to measure meaningful gains with feasible compute.
They outline an iterative search plan (univariate search, grid search, then Optuna) but note that the number of required runs would take months on a 3090, likely making the effort obsolete by the time results are clean.

I am trying to optimize Qwen 3.6 35B A3B sampling parameters but I am having a hard time figuring out a good benchmark to do it.

As to why I believe that the recommended settings may not be optimal? One reason is that they recommend the same ones for Qwen 3.5 and 3.6 yet when I upgraded to 3.6 with everything else being identical (even the same quant) 3.6 was getting stuck in tool call loops in some programmed daily tasks in which 3.5 was not and the solution was bumping the temperature up. Another is that their numbers are round and typical values which likely means that no extensive fine tuning was done.

I am also quite suspicious of the min_p=0.0 reccomendation being actually optimal. A small min_p value would likely allow relaxing other samplers being less restrictive towards plausible tokens but more about the less plausible ones than the current configs.

I have tried GSM8K and the metabench subset of GSM8K, IFEval and GPQA diamond.

GSM8K and IFEval are too saturated.

The metabench subset of GSM8K is not saturated but has at least a 20% run to run variance.

GPQA Diamond is better behaved but has at least 2.5% of variance and each run in my 3090 takes almost 3 h, so to get a clean signal I would likely need 10 runs per setting.

My plan was to do a 10 points univariate search centered against the average of Qwen recommended ranges with the exception of min_p as they recommend 0.0.

Then using that to determine the ranges of a grid search with 3 values per parameter (the univariate optimal and the points at which it has fallen 50% of what it can fall over the whole range).

Then from the optimal cell run Optuna to try squeezing the last bit.

The problem is that with temperature, top_p, top_k and min_p alone the first phase is 40 points (more if the optimals are too off center as some extra runs would be needed), the second 81 and the third who knows?

So the first two phases alone in my GPU are a solid 5 months of compute and next Qwen will likely be out by then.

There was a previous 3.5 thread but it was mostly vibes about what settings may be better: https://old.reddit.com/r/LocalLLaMA/comments/1ryb028/qwen35_best_parameters_collection/

Maybe there isn't a good quick and low variance benchmark that would discern between configurations. As to actually benchmark sampling differences you can't use logprobs benchmarks (or I don't know any way) and you need to use generative benchmarks. There are less of those and are way slower.

Also the sampling itself introduces variance and it may very well be that when sampling is involved you need a ton of questions to average that out.

So leaving this here in case someone either knows a better set of benchmarks that would complete in a reasonable amount of time with my 3090, or a better way to evaluate or someone compute rich happens to want to squeeze the last drop out of Qwen.

submitted by /u/while-1-fork
[link] [comments]

Black Hat USA

AI Business

Free AI Detection app designed specifically for Social Media posts

Reddit r/artificial

Why Your Production LLM Prompt Keeps Failing (And How to Diagnose It in 4 Steps)

Dev.to

Explainable Causal Reinforcement Learning for satellite anomaly response operations under multi-jurisdictional compliance

Dev.to

How to Build AI-Powered Automation Workflows for Small Businesses — A Developer'

Dev.to

Optimizing Qwen 3.6 35B A3B sampling parameters.

Key Points

Related Articles

Black Hat USA

Free AI Detection app designed specifically for Social Media posts

Why Your Production LLM Prompt Keeps Failing (And How to Diagnose It in 4 Steps)

Explainable Causal Reinforcement Learning for satellite anomaly response operations under multi-jurisdictional compliance

How to Build AI-Powered Automation Workflows for Small Businesses — A Developer'

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer