MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author compares two open-weight GGUF quantized LLMs—MiniMax-M2.7 (IQ2_KS) and Qwen3.5-122B-A10B (IQ5_KS)—on a 96GB VRAM full-offload dual-A6000 setup using local inference tooling.
Based on a quick EvalPlus/ humaneval run, Qwen3.5-122B-A10B achieves a higher pass@1 score (0.494) than MiniMax-M2.7 (0.220), with similar overall evaluation behavior.
In practical local coding (“vibecoding”), the author reports Qwen3.5 delivers better inference speed, code quality, and overall user experience for their workflow.
The post emphasizes that performance can vary and the benchmark harness may not perfectly reflect expected settings, but the author’s overall conclusion is that Qwen3.5 remains preferable for their 96GB offload use case today.

MiniMax-M2.7 vs Qwen3.5-122B-A10B for 96GB VRAM full offload?!

tl;dr;

For 96GB VRAM full offload rigs, I'd probably choose Qwen3.5-122B-A10B over MiniMax-M2.7 today. Curious what y'all experience is.

Quants Tested

ubergarm/MiniMax-M2.7-GGUF IQ2_KS 69.800 GiB (2.622 BPW)
ubergarm/Qwen3.5-122B-A10B-GGUF IQ5_KS 77.341 GiB (5.441 BPW)

Rambling Details

Its amazing now we have multiple open weights LLMs that work pretty well for local vibecoding! Both quants tested and work well enough with opencode configured to enable/disable thinking dynamically (really speeds up generating 5 word thread title lol).

Thanks to Wendell of level1techs I have access to rig with 96GB VRAM for benchmarking and making GGUF quants. My daily driver has been Qwen3.5-122B fully offloaded on the 2x A6000 GPUs (kind of like a 3090 with 48GB VRAM each). Now with new MiniMax-M2.7 quants, I had to decide if a more quantized larger model would be better or not?

Like all complex questions, the answer is usually, "it depends"!

But at least for my purposes, it seems like Qwen3.5-122B-A10B is still on top for inference speed, code quality, and general quality of life.

Here is some data to back up this opinion:

humaneval benchmark

I vibe coded a quick EvalPlus python client and threw the 164 problem humaneval benchmark at both of the quants running on ik_llama.cpp llama-server.

Metric	MiniMax-M2.7 IQ2_KS	Qwen3.5-122B-A10B IQ5_KS
pass@1 (base)	0.220	0.494
pass@1 (base+extra)	0.220	0.482
Eval time	32:48	31:20

This was using temperature=1.0 and top_p=0.95 as suggested by MiniMax's model card. To be fair, this was a quick vibecoded client test harness, so maybe something is off. Not sure what the results should even look like haha... But Qwen3.5 got a higher score!

inference speed

I ran llama-sweep-bench on the same version of ik_llama.cpp using command similar to the llama-server one I used for evaluation filling up most of the 96GB VRAM. While MiniMax-2.7 could go out further, i got tired of waiting and hit control-c on the test. You get the point.

https://preview.redd.it/4t0gcl7y4uug1.png?width=2087&format=png&auto=webp&s=ea2db24e196c0e132efcf101aed8db205fd62b87

quality of life

MiniMax-M2.7 does support some self-speculative-decoding whereas Qwen3.5 does not (recurrent model). However, it requires fairly heavily quantized kv-cache to fit even 160k kv-cache.

Qwen3.5-122B runs with mmproj loaded for image processing and supports full 256k unquantized kv-cache which is just nice.

Conclusion

I'm hungry its dinner time.

submitted by /u/VoidAlchemy
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

v0.20.6

Ollama Releases

Anthropic Launches Project Glasswing for AI Security

Dev.to

Claude Code Tips and Tricks You Are Missing