Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible)

Reddit r/LocalLLaMA / 3/24/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post introduces oQ, a data-driven mixed-precision quantization method for Apple Silicon designed to improve quality over uniform bit-width quantization commonly used with mlx-lm.
Instead of using fixed rules, oQ calibrates each layer to estimate quantization sensitivity and then allocates precision (bits) dynamically where it matters most.
It aims to reduce user friction by working as a universal approach compatible with models beyond any single inference server, while fitting into the mlx-lm ecosystem (mlx-lm compatible).
The author notes that critical layers vary by model architecture and that oQ accounts for this using calibration datasets and configurable precision floors.
Preliminary benchmarks on Qwen3.5-35B-A3B suggest sizable accuracy gains for oQ at low bit settings compared with mlx-lm’s standard uniform quantization.

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible)

One of the things i found most frustrating while using mlx-lm was the quality of models quantized with a single uniform bit width. Sure, mlx-lm supports various quantization options, but for most users, downloading a full-precision model and quantizing it yourself is a real barrier. (Even if someone tells you it's easy. The fear of the CLI is real.)

So i started thinking. Quantization should not be exclusive to any particular inference server. The mlx-lm platform already provides a solid foundation, and on top of that, users should be able to use any model they want, on any server they prefer, regardless of who quantized it.

That thinking led me to build oQ: oMLX Universal Dynamic Quantization.

oQ is a data-driven mixed-precision quantization system for Apple Silicon. Instead of assigning bits by fixed rules or tensor type, oQ measures each layer's actual quantization sensitivity through calibration and allocates bits where the data says they matter most.

Not every model shares the same architecture. Are the first and last layers really always the most important? (Okay, in most cases they are. But not always.) Different model structures have different critical layers, and the minimum precision floor varies too. oQ uses calibration datasets to perform sensitivity-driven allocation, identifying which layers are critical and which ones can tolerate lower precision.

I'll keep the technical details brief here. If you want to dig deeper, check out the full documentation: oQ Quantization

At least for now, i think i've found the daily-use quantization i was looking for. Everyone has their own favorite quantization approach, but if you haven't found yours yet, or if you're still using the default mlx-lm quant, i'd recommend giving oQ a try.

Benchmarks (Qwen3.5-35B-A3B)

Benchmark	Samples	2-bit mlx-lm	2-bit oQ	3-bit mlx-lm	3-bit oQ	4-bit mlx-lm	4-bit oQ
MMLU	300	14.0%	64.0%	76.3%	85.0%	79.7%	83.3%
TRUTHFULQA	300	17.0%	80.0%	81.7%	86.7%	87.7%	88.0%
HUMANEVAL	164 (full)	0.0%	78.0%	84.8%	86.6%	87.2%	85.4%
MBPP	300	0.3%	63.3%	69.0%	72.0%	71.7%	74.3%

You can quantize models from Github (omlx.ai), and the output works with any inference server. Try it in oMLX, or load the pre-quantized models straight into whatever you're already using, whether that's LM Studio or anything else: https://huggingface.co/Jundot/models

submitted by /u/cryingneko
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/24DailyView insight →

Interactive Web Visualization of GPT-2

Reddit r/artificial

From infrastructure to AI: how Alibaba Cloud powers the global ambitions of Chinese companies

SCMP Tech

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

InVideo AI Review: Fast Finished

Dev.to

Introducing oQ: data-driven mixed-precision quantization for Apple Silicon (mlx-lm compatible)

Key Points

Benchmarks (Qwen3.5-35B-A3B)

💡 Insights using this article

Related Articles

Interactive Web Visualization of GPT-2

From infrastructure to AI: how Alibaba Cloud powers the global ambitions of Chinese companies

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

InVideo AI Review: Fast Finished

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer