I compared harrier-27b vs voyage-4 vs zembed-1 across 24 datasets. 27B parameters

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

Key Points

  • The author evaluated embedding models Harrier-27B, Voyage-4, and ZEmbed-1 on 24 datasets using continuous 0–10 relevance scoring and three independent LLM judges, finding them nearly tied on NDCG@10 (≈0.699–0.701).
  • Despite the NDCG tie, ZEmbed-1 leads operationally on Recall@100 (0.750 vs. 0.731 for Voyage-4 and 0.728 for Harrier-27B), which matters because retrievers cannot recover documents the embedder never surfaces.
  • In stacked retrieval+reranking experiments, ZEmbed-1 shows the largest recall-to-precision improvement from reranking (+5.2% to +6.6%), while Harrier-27B trails (+4.2% to +4.4%) and Voyage-4 sits in between (+4.5% to +4.9%).
  • The “real fight” between Harrier-27B and Voyage-4 comes down to deployment: Voyage-4 is API-only and proprietary, whereas Harrier-27B is MIT-licensed and self-hostable, making it the better choice when open weights are required.
  • The results suggest Harrier-27B’s quality is competitive but may have a lower ceiling than ZEmbed-1 (and narrowly behind Voyage-4) for RAG-style retrieval pipelines where recall drives downstream gains.

I've been running embedding model evals for a while now, and Microsoft's Harrier family dropped a new model. btw harrier-27b hit #1 on binary MTEB at launch. That's not nothing. So I put it through the same graded evaluation pipeline I use for everything else - 24 datasets, three independent LLM judges, continuous relevance scores 0–10. No binary pass/fail.

The global numbers

Model NDCG@10 Recall@100
zembed-1 0.701 0.750
voyage-4 0.699 0.731
harrier-27b 0.699 0.728

On NDCG@10, it's basically a three-way tie at the top. harrier-27b is legitimately competitive I won't pretend otherwise. But NDCG@10 isn't the whole story, especially in RAG pipelines.

The number that actually matters operationally is [Recall@100](mailto:Recall@100). That's whether a relevant document even survives to your reranker. Your reranker can reorder whatever the embedder surfaces, but it cannot conjure up a document the embedder dropped. zembed-1 leads by +2.2 points over harrier-27b here. That gap compounds downstream.

Where reranking amplifies the recall advantage

When I stacked each embedder with a reranker, the recall-to-precision conversion rates told an even clearer story:

Method Top-10 lift range
harrier-27b + reranker +4.2% to +4.4%
voyage-4 + reranker +4.5% to +4.9%
zembed-1 + reranker +5.2% to +6.6%

zembed-1 consistently extracts more signal from the reranking step because it hands the reranker a better candidate pool to begin with. harrier-27b's ceiling is lower at every threshold tested.

harrier-27b vs voyage-4: the real fight for second place

I expected harrier-27b with its 27B parameters and #1 MTEB debut to comfortably displace voyage-4 from the #2 spot. It didn't.

They're dead even on NDCG@10 at 0.699. voyage-4 edges ahead on Recall@100 (0.731 vs 0.728) and wins 12 datasets to harrier's 11 in the head-to-head.

What actually differentiates them is deployment: voyage-4 is API-only and proprietary, harrier-27b is MIT-licensed and self-hostable. If you need open weights with no API dependency, harrier-27b wins that argument regardless of the quality tie. If your workload skews multilingual, harrier also has a real edge trained across 94 languages with GPT-5 synthetic data, and it shows on non-English reranking tasks.

Dataset-by-dataset: harrier-27b vs zembed-1

I went dataset by dataset across the full 24. zembed-1 beats harrier-27b on 14 of them. The pattern is telling:

  • zembed-1 dominates on instruction retrieval (Core17, News21, Robust04) tasks requiring parsed query intent, not keyword overlap and on legal and medical corpora (LegalBench, CovidRetrieval, TRECCOVID).
  • harrier-27b shows genuine strength on multilingual reranking RuBQReranking (Russian), TwitterHjerne (Danish). If your use case is multilingual and reranking-heavy, this is worth knowing.

Among the three top models, zembed-1 takes 1st place on 11 of 23 datasets vs. 6 each for voyage-4 and harrier-27b. It's not just the average that's better it's the most consistently top-ranked model.

The efficiency problem

harrier-27b: 27B parameters, 5,376-dimensional vectors. zembed-1: 4B parameters, 2,560-dimensional vectors.

~7x the compute, 2x the storage, for 0.2% worse NDCG@10 and 2.2 points worse [Recall@100](mailto:Recall@100). In a batch job, maybe you absorb that. In a real-time RAG system, you're paying a serious penalty for strictly worse results.

My take

harrier-27b is a legitimate top-three model the strongest new entrant since voyage-4. For multilingual workloads or teams that need self-hostable open weights, it's worth serious evaluation, and it's genuinely competitive with voyage-4 on those terms.

But it doesn't change the leaderboard. zembed-1 wins 14 of 24 datasets head-to-head, leads on Recall@100, and does it at a fraction of the compute.

submitted by /u/Veronildo
[link] [comments]