Hey everyone,
Yesterday I shared some static embedding models I'd been working on using model2vec + tokenlearn. Since then I've been grinding on improvements and ended up with something I think is pretty cool, a full family of models ranging from 125MB down to 700KB, all drop-in compatible with model2vec and sentence-transformers.
The lineup:
| Model | Avg (25 tasks MTEB) | Size | Speed (CPU) |
|---|---|---|---|
| potion-mxbai-2m-512d | 72.13 | ~125MB | ~16K sent/s |
| potion-mxbai-256d-v2 | 70.98 | 7.5MB | ~15K sent/s |
| potion-mxbai-128d-v2 | 69.83 | 3.9MB | ~18K sent/s |
| potion-mxbai-micro | 68.12 | 0.7MB | ~18K sent/s |
Evaluated on 25 tasks (10 STS, 12 Classification, 3 PairClassification), English subsets only. Note: sent/s is sentences/second on my i7-9750H
These are NOT transformers! they're pure lookup tables. No neural network forward pass at inference. Tokenize, look up embeddings, mean pool, The whole thing runs in numpy.
For context, all-MiniLM-L6-v2 scores 74.65 avg at ~80MB and ~200 sent/sec on the same benchmark. So the 256D model gets ~95% of MiniLM's quality at 10x smaller and 150x faster.
The 700KB micro model is the one I'm most excited about. It uses vocabulary quantization (clustering 29K token embeddings down to 2K centroids) and scores 68.12 on the full MTEB English suite.
But why..?
Fair question. To be clear, it is a semi-niche usecase, but:
Edge/embedded/WASM, try loading a 400MB ONNX model in a browser extension or on an ESP32. These just work anywhere you can run numpy and making a custom lib probably isn't that difficult either.
Batch processing millions of docs, when you're embedding your entire corpus, 15K sent/sec on CPU with no GPU means you can process 50M documents overnight on a single core. No GPU scheduling, no batching headaches.
Cost, These run on literally anything, reuse any ewaste as a embedding server! (Another project I plan to share here soon is a custom FPGA built to do this with one of these models!)
Startup time, transformer models take seconds to load. These load in milliseconds. If you're doing one-off embeddings in a CLI tool or serverless function its great.
Prototyping, sometimes you just want semantic search working in 3 lines of code without thinking about infrastructure. Install model2vec, load the model, done, Ive personally already found plenty of use in the larger model for that exact reason.
How to use them:
```python from model2vec import StaticModel
Pick your size
model = StaticModel.from_pretrained("blobbybob/potion-mxbai-256d-v2")
or the tiny one
model = StaticModel.from_pretrained("blobbybob/potion-mxbai-micro")
embeddings = model.encode(["your text here"]) ```
All models are on HuggingFace under blobbybob. Built on top of MinishLab's model2vec and tokenlearn, great projects if you haven't seen them.
Happy to answer questions, Still have a few ideas on the backlog but wanted to share where things are at.
[link] [comments]



