ArcFace embeddings quantized to 16-bit pgvector HALFVEC ? [D]

Reddit r/MachineLearning / 4/12/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical Usage

Key Points

  • The post argues that storing 512-d face embeddings as 32-bit floats in pgvector can exceed PostgreSQL TOAST inline thresholds, causing extra I/O due to out-of-line storage.
  • It suggests that using a HALFVEC (16-bit quantization) could cut storage in half and improve read efficiency by keeping embeddings inline rather than in TOAST.
  • The author questions whether 32-bit precision is necessary for ArcFace embeddings, noting training losses often separate identities significantly in embedding space.
  • The post implies that 16-bit quantization might have negligible impact on face-similarity quality, potentially affecting only very small decimal-level differences, but asks whether that assumption is correct.
  • It concludes with a request for confirmation on whether 16-bit pgvector quantization for ArcFace is a standard practice or whether there are important missing considerations (accuracy/metric effects).

512-dim face embeddings as 32-bit floats are 2048 bytes, plus a 4-8 byte header, putting them just a hair over over PostgreSQL's TOAST threshold (2040 bytes), meaning by default postgresql always dumps them into a TOAST table instead of keeping them in line (result: double the I/O because it has to look up a data pointer and do another read).

Obviously HNSW bypasses this issue entirely, but I'm wondering if 32-bit precision for ArcFace embeddings even makes a difference? The loss functions these models are trained with tend to push same-identity faces and different-identity faces pretty far apart in space. So should be fine to quantize these to 16 bits, if my math maths, that's not going to make a difference in real world situations (if you translate it to a normalize 0.0 - 100.0 "face similarity" we're talking something differences somewhere around the third decimal place so 0.001 or so).

A HALFVEC would be 1/2 the storage and would also be half the I/O ops because they'd get stored inline rather than spilled out to TOAST, and get picked up in the same page read.

Does this sound right? Is this a pretty standard way to quantize ArcFace embeddings or am I missing something?

submitted by /u/dangerousdotnet
[link] [comments]