700KB embedding model that actually works, built a full family of static models from 0.7MB to 125MB

Reddit r/LocalLLaMA / 4/3/2026

📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author released a drop-in family of ultra-small static sentence embedding lookup-table models (from ~125MB down to ~700KB) compatible with model2vec and sentence-transformers.
The models are explicitly non-transformer: at inference they only tokenize, look up precomputed embeddings, mean-pool, and run the pipeline in NumPy without any neural network forward pass.
On English-only MTEB evaluations across 25 tasks, the ~256D model achieves ~70.98 average score and is benchmarked at ~15K sentences/second on CPU, offering near-MiniLM quality at dramatically higher throughput and smaller size.
The smallest ~700KB “micro” model uses vocabulary quantization by clustering 29K token embeddings into 2K centroids, reaching ~68.12 average on the full English MTEB suite.
The post positions the approach for niche-but-practical deployment scenarios like edge/WASM/embedded targets, large-scale CPU-only corpus embedding, lower cost, and fast startup for CLI/serverless one-off embedding workloads.

Hey everyone,

Yesterday I shared some static embedding models I'd been working on using model2vec + tokenlearn. Since then I've been grinding on improvements and ended up with something I think is pretty cool, a full family of models ranging from 125MB down to 700KB, all drop-in compatible with model2vec and sentence-transformers.

The lineup:

Model	Avg (25 tasks MTEB)	Size	Speed (CPU)
potion-mxbai-2m-512d	72.13	~125MB	~16K sent/s
potion-mxbai-256d-v2	70.98	7.5MB	~15K sent/s
potion-mxbai-128d-v2	69.83	3.9MB	~18K sent/s
potion-mxbai-micro	68.12	0.7MB	~18K sent/s

Evaluated on 25 tasks (10 STS, 12 Classification, 3 PairClassification), English subsets only. Note: sent/s is sentences/second on my i7-9750H

These are NOT transformers! they're pure lookup tables. No neural network forward pass at inference. Tokenize, look up embeddings, mean pool, The whole thing runs in numpy.

For context, all-MiniLM-L6-v2 scores 74.65 avg at ~80MB and ~200 sent/sec on the same benchmark. So the 256D model gets ~95% of MiniLM's quality at 10x smaller and 150x faster.

The 700KB micro model is the one I'm most excited about. It uses vocabulary quantization (clustering 29K token embeddings down to 2K centroids) and scores 68.12 on the full MTEB English suite.

But why..?

Fair question. To be clear, it is a semi-niche usecase, but:

Edge/embedded/WASM, try loading a 400MB ONNX model in a browser extension or on an ESP32. These just work anywhere you can run numpy and making a custom lib probably isn't that difficult either.
Batch processing millions of docs, when you're embedding your entire corpus, 15K sent/sec on CPU with no GPU means you can process 50M documents overnight on a single core. No GPU scheduling, no batching headaches.
Cost, These run on literally anything, reuse any ewaste as a embedding server! (Another project I plan to share here soon is a custom FPGA built to do this with one of these models!)
Startup time, transformer models take seconds to load. These load in milliseconds. If you're doing one-off embeddings in a CLI tool or serverless function its great.
Prototyping, sometimes you just want semantic search working in 3 lines of code without thinking about infrastructure. Install model2vec, load the model, done, Ive personally already found plenty of use in the larger model for that exact reason.

How to use them:

```python from model2vec import StaticModel

Pick your size

model = StaticModel.from_pretrained("blobbybob/potion-mxbai-256d-v2")

or the tiny one

model = StaticModel.from_pretrained("blobbybob/potion-mxbai-micro")

embeddings = model.encode(["your text here"]) ```

All models are on HuggingFace under blobbybob. Built on top of MinishLab's model2vec and tokenlearn, great projects if you haven't seen them.

Happy to answer questions, Still have a few ideas on the backlog but wanted to share where things are at.

submitted by /u/ghgi_
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/3DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Cycle 244: Why I Can't Sell My Digital Products (Yet) - An AI's Struggle with KYC and Financial APIs

Dev.to

langchain-core==1.2.25

LangChain Releases

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.