I've been running embedding model evals for a while now, and Microsoft's Harrier family dropped a new model. btw harrier-27b hit #1 on binary MTEB at launch. That's not nothing. So I put it through the same graded evaluation pipeline I use for everything else - 24 datasets, three independent LLM judges, continuous relevance scores 0–10. No binary pass/fail.
The global numbers
| Model | NDCG@10 | Recall@100 |
|---|---|---|
| zembed-1 | 0.701 | 0.750 |
| voyage-4 | 0.699 | 0.731 |
| harrier-27b | 0.699 | 0.728 |
On NDCG@10, it's basically a three-way tie at the top. harrier-27b is legitimately competitive I won't pretend otherwise. But NDCG@10 isn't the whole story, especially in RAG pipelines.
The number that actually matters operationally is [Recall@100](mailto:Recall@100). That's whether a relevant document even survives to your reranker. Your reranker can reorder whatever the embedder surfaces, but it cannot conjure up a document the embedder dropped. zembed-1 leads by +2.2 points over harrier-27b here. That gap compounds downstream.
Where reranking amplifies the recall advantage
When I stacked each embedder with a reranker, the recall-to-precision conversion rates told an even clearer story:
| Method | Top-10 lift range |
|---|---|
| harrier-27b + reranker | +4.2% to +4.4% |
| voyage-4 + reranker | +4.5% to +4.9% |
| zembed-1 + reranker | +5.2% to +6.6% |
zembed-1 consistently extracts more signal from the reranking step because it hands the reranker a better candidate pool to begin with. harrier-27b's ceiling is lower at every threshold tested.
harrier-27b vs voyage-4: the real fight for second place
I expected harrier-27b with its 27B parameters and #1 MTEB debut to comfortably displace voyage-4 from the #2 spot. It didn't.
They're dead even on NDCG@10 at 0.699. voyage-4 edges ahead on Recall@100 (0.731 vs 0.728) and wins 12 datasets to harrier's 11 in the head-to-head.
What actually differentiates them is deployment: voyage-4 is API-only and proprietary, harrier-27b is MIT-licensed and self-hostable. If you need open weights with no API dependency, harrier-27b wins that argument regardless of the quality tie. If your workload skews multilingual, harrier also has a real edge trained across 94 languages with GPT-5 synthetic data, and it shows on non-English reranking tasks.
Dataset-by-dataset: harrier-27b vs zembed-1
I went dataset by dataset across the full 24. zembed-1 beats harrier-27b on 14 of them. The pattern is telling:
- zembed-1 dominates on instruction retrieval (Core17, News21, Robust04) tasks requiring parsed query intent, not keyword overlap and on legal and medical corpora (LegalBench, CovidRetrieval, TRECCOVID).
- harrier-27b shows genuine strength on multilingual reranking RuBQReranking (Russian), TwitterHjerne (Danish). If your use case is multilingual and reranking-heavy, this is worth knowing.
Among the three top models, zembed-1 takes 1st place on 11 of 23 datasets vs. 6 each for voyage-4 and harrier-27b. It's not just the average that's better it's the most consistently top-ranked model.
The efficiency problem
harrier-27b: 27B parameters, 5,376-dimensional vectors. zembed-1: 4B parameters, 2,560-dimensional vectors.
~7x the compute, 2x the storage, for 0.2% worse NDCG@10 and 2.2 points worse [Recall@100](mailto:Recall@100). In a batch job, maybe you absorb that. In a real-time RAG system, you're paying a serious penalty for strictly worse results.
My take
harrier-27b is a legitimate top-three model the strongest new entrant since voyage-4. For multilingual workloads or teams that need self-hostable open weights, it's worth serious evaluation, and it's genuinely competitive with voyage-4 on those terms.
But it doesn't change the leaderboard. zembed-1 wins 14 of 24 datasets head-to-head, leads on Recall@100, and does it at a fraction of the compute.
[link] [comments]




