[R] PCA rotation makes non-Matryoshka embeddings truncatable — 27x compression at 99% recall with reranking

Reddit r/LocalLLaMA / 4/11/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Most embedding models can’t be naively truncated for Matryoshka-style dimension dropping, since straight truncation sharply degrades similarity and retrieval quality.
  • The article shows a simple alternative: fit PCA on a sample of your embeddings, rotate vectors into the PCA basis, then truncate—achieving much higher cosine similarity (e.g., 256D after PCA reaches ~0.974 vs ~0.467 without PCA), without any retraining.
  • It combines PCA rotation/truncation with scalar quantization (including an orthogonal rotation to make coordinates more Gaussian) to reach large compression, reporting up to ~27x smaller embeddings while retaining usable retrieval performance.
  • In a benchmark on 2.4M embeddings, PCA-quantized variants yield moderate single-stage recall@10, but adding standard 5x oversampling plus exact reranking boosts performance dramatically, reaching ~99.4% recall@10.
  • A key finding is that cosine similarity can “mislead” at high compression: one setting has lower cosine yet higher recall, implying that ranking quality and reconstruction metrics diverge under aggressive compression.

Most embedding models (BGE-M3, E5, ada-002, Cohere) weren't trained with Matryoshka losses, so you can't just drop trailing dimensions. We tried: truncating BGE-M3 from 1024 to 256 dims gives 0.467 cosine similarity. Unusable.

The fix is embarrassingly simple. Fit PCA on a sample of your embeddings (~5K vectors is enough), then rotate all vectors into the principal component basis before truncating. The eigenvalues reorder dimensions by importance, so truncation now discards the least important ones instead of arbitrary ones.

Result: PCA truncation to 256 dims gives 0.974 cosine similarity. That's a 109% improvement from a one-line linear transformation with no retraining.

The compression pipeline

Stack PCA dimension reduction with scalar quantization (3-bit per coordinate, using the PolarQuant rotation trick from Zandieh et al. ICLR 2026):

  1. PCA rotate + truncate to 384 dims (from 1024)
  2. Random orthogonal rotation (makes coordinates ~Gaussian)
  3. Lloyd-Max 3-bit quantization + bit-packing

Result: 27x compression (4096 bytes → 148 bytes per embedding).

The recall numbers (this is the part that matters)

We benchmarked on a 2.4M-vector cross-civilizational ethics corpus (BGE-M3 embeddings). Here's what we found:

Method Compression Recall@10
Scalar int8 4x 97.2%
TurboQuant 3-bit 10.6x 83.8%
PCA-384 + TQ3 27.7x 77.0%
PCA-256 + TQ3 41.0x 78.2%
Binary quantization 32x 66.6%
Product quantization (M=16) 256x 41.4%

77% recall single-stage isn't great. But with standard 5x oversampling + exact reranking (fetch 50 candidates, rescore with original vectors), it jumps to 99.4% recall@10. We verified this on 50K production embeddings, not synthetic data.

For comparison, TQ3 alone goes from 81% to 100% with the same reranking trick. The reranking cost is negligible — you're rescoring 50 vectors, not 2.4M.

The surprising finding: cosine similarity lies to you

This was the most interesting part of the paper. Look at these two rows:

  • PCA-384 + TQ3: 0.979 cosine similarity, 76.4% recall@10
  • PCA-256 + TQ3: 0.963 cosine similarity, 78.2% recall@10

PCA-256 has lower cosine similarity but higher recall. The per-vector reconstruction fidelity metric diverges from the ranking quality metric at high compression. Small perturbations distributed across many vectors can swap the order of closely-ranked items even when each individual vector looks good.

Takeaway: If you're evaluating embedding compression for retrieval, report recall@k, not just cosine similarity. We almost made this mistake ourselves — the cosine numbers made PCA-384 look better than PCA-256, but recall tells the opposite story.

What doesn't work

  • Naive truncation of non-Matryoshka models. Just dropping dims is catastrophic (0.467 cosine at 50% dims, 0.333 at 25% dims). The information is distributed roughly uniformly — you need PCA to concentrate it.
  • Product quantization at the same compression range. PQ (M=16 K=256) gets 256x compression but only 41% recall. PCA-128 + TQ3 gets 79x compression at 79% recall — strictly dominates PQ in the 30-80x range.
  • Relying on cosine similarity to evaluate compression quality. We keep repeating this because it's the easiest trap to fall into.

Two bonus findings from the implementation work

Learned codebooks: The standard Lloyd-Max quantization assumes rotated coordinates are Gaussian. They're not — the tails are heavier. Training a codebook on your actual rotated data (just 1D k-means, 50 iterations) reduces quantization error by 22% at the same 3 bits. Works consistently across models.

Asymmetric K/V allocation for KV caches: Keys are more sensitive to quantization than values because softmax amplifies errors in K. Using 4-bit keys / 2-bit values gives 0.995 key cosine similarity at the same storage as uniform 3-bit. Free quality win on the dimension that matters.

The paper is under review at IEEE TAI. Code: https://github.com/ahb-sjsu/turboquant-pro (pip install turboquant-pro)

Happy to discuss the methodology or the cosine-vs-recall finding — that's the part I think has the broadest implications beyond our specific use case.

submitted by /u/ahbond
[link] [comments]