[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Most embedding models can’t be naively truncated for Matryoshka-style dimension dropping, since straight truncation sharply degrades similarity and retrieval quality.
The article shows a simple alternative: fit PCA on a sample of your embeddings, rotate vectors into the PCA basis, then truncate—achieving much higher cosine similarity (e.g., 256D after PCA reaches ~0.974 vs ~0.467 without PCA), without any retraining.
It combines PCA rotation/truncation with scalar quantization (including an orthogonal rotation to make coordinates more Gaussian) to reach large compression, reporting up to ~27x smaller embeddings while retaining usable retrieval performance.
In a benchmark on 2.4M embeddings, PCA-quantized variants yield moderate single-stage recall@10, but adding standard 5x oversampling plus exact reranking boosts performance dramatically, reaching ~99.4% recall@10.
A key finding is that cosine similarity can “mislead” at high compression: one setting has lower cosine yet higher recall, implying that ranking quality and reconstruction metrics diverge under aggressive compression.

Most embedding models (BGE-M3, E5, ada-002, Cohere) weren't trained with Matryoshka losses, so you can't just drop trailing dimensions. We tried: truncating BGE-M3 from 1024 to 256 dims gives 0.467 cosine similarity. Unusable.

The fix is embarrassingly simple. Fit PCA on a sample of your embeddings (~5K vectors is enough), then rotate all vectors into the principal component basis before truncating. The eigenvalues reorder dimensions by importance, so truncation now discards the least important ones instead of arbitrary ones.

Result: PCA truncation to 256 dims gives 0.974 cosine similarity. That's a 109% improvement from a one-line linear transformation with no retraining.

The compression pipeline

Stack PCA dimension reduction with scalar quantization (3-bit per coordinate, using the PolarQuant rotation trick from Zandieh et al. ICLR 2026):

PCA rotate + truncate to 384 dims (from 1024)
Random orthogonal rotation (makes coordinates ~Gaussian)
Lloyd-Max 3-bit quantization + bit-packing

Result: 27x compression (4096 bytes → 148 bytes per embedding).

The recall numbers (this is the part that matters)

We benchmarked on a 2.4M-vector cross-civilizational ethics corpus (BGE-M3 embeddings). Here's what we found:

Method	Compression	Recall@10
Scalar int8	4x	97.2%
TurboQuant 3-bit	10.6x	83.8%
PCA-384 + TQ3	27.7x	77.0%
PCA-256 + TQ3	41.0x	78.2%
Binary quantization	32x	66.6%
Product quantization (M=16)	256x	41.4%

77% recall single-stage isn't great. But with standard 5x oversampling + exact reranking (fetch 50 candidates, rescore with original vectors), it jumps to 99.4% recall@10. We verified this on 50K production embeddings, not synthetic data.

For comparison, TQ3 alone goes from 81% to 100% with the same reranking trick. The reranking cost is negligible — you're rescoring 50 vectors, not 2.4M.

The surprising finding: cosine similarity lies to you

This was the most interesting part of the paper. Look at these two rows:

PCA-384 + TQ3: 0.979 cosine similarity, 76.4% recall@10
PCA-256 + TQ3: 0.963 cosine similarity, 78.2% recall@10

PCA-256 has lower cosine similarity but higher recall. The per-vector reconstruction fidelity metric diverges from the ranking quality metric at high compression. Small perturbations distributed across many vectors can swap the order of closely-ranked items even when each individual vector looks good.

Takeaway: If you're evaluating embedding compression for retrieval, report recall@k, not just cosine similarity. We almost made this mistake ourselves — the cosine numbers made PCA-384 look better than PCA-256, but recall tells the opposite story.

What doesn't work

Naive truncation of non-Matryoshka models. Just dropping dims is catastrophic (0.467 cosine at 50% dims, 0.333 at 25% dims). The information is distributed roughly uniformly — you need PCA to concentrate it.
Product quantization at the same compression range. PQ (M=16 K=256) gets 256x compression but only 41% recall. PCA-128 + TQ3 gets 79x compression at 79% recall — strictly dominates PQ in the 30-80x range.
Relying on cosine similarity to evaluate compression quality. We keep repeating this because it's the easiest trap to fall into.

Two bonus findings from the implementation work

Learned codebooks: The standard Lloyd-Max quantization assumes rotated coordinates are Gaussian. They're not — the tails are heavier. Training a codebook on your actual rotated data (just 1D k-means, 50 iterations) reduces quantization error by 22% at the same 3 bits. Works consistently across models.

Asymmetric K/V allocation for KV caches: Keys are more sensitive to quantization than values because softmax amplifies errors in K. Using 4-bit keys / 2-bit values gives 0.995 key cosine similarity at the same storage as uniform 3-bit. Free quality win on the dimension that matters.

The paper is under review at IEEE TAI. Code: https://github.com/ahb-sjsu/turboquant-pro (pip install turboquant-pro)

Happy to discuss the methodology or the cosine-vs-recall finding — that's the part I think has the broadest implications beyond our specific use case.

submitted by /u/ahbond
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/11DailyView insight →

Black Hat USA

AI Business

Black Hat Asia

AI Business

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage

Dev.to

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

Dev.to

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

Dev.to

[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking

Key Points

The compression pipeline

The recall numbers (this is the part that matters)

The surprising finding: cosine similarity lies to you

What doesn't work

Two bonus findings from the implementation work

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

The compression pipeline

The recall numbers (this is the part that matters)

The surprising finding: cosine similarity lies to you

What doesn't work

Two bonus findings from the implementation work

💡 Insights using this article

Related Articles

Black Hat USA

Black Hat Asia

Fully Automated Website 2026-04-11: **The Scoreboard — Visual Judge Score Comparison on the Homepage**

Human-Aligned Decision Transformers for satellite anomaly response operations with ethical auditability baked in

That Smoking-Gun Video? It's Not Evidence. It's a Suspect.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Fully Automated Website 2026-04-11: The Scoreboard — Visual Judge Score Comparison on the Homepage