LVFace performance vs. ArcFace/ResNet

Reddit r/MachineLearning / 3/29/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The post asks for real-world benchmarks on ByteDance’s LVFace (ICCV 2025) as a replacement for an InsightFace-style face recognition pipeline using SCRFD detection plus ArcFace/ResNet embeddings.
The author is mainly concerned about production tradeoffs: LVFace’s ViT backbone may be slower than r50 ArcFace and could increase VRAM requirements for high-concurrency batching.
A key motivation is improved facial discrimination under occlusions such as masks, since ArcFace reportedly may produce misleading embeddings by focusing on masked regions rather than the most informative facial areas.
The author also seeks evidence that LVFace’s reported challenge performance (e.g., Masked Face Recognition) translates to better field recall, especially for large-scale searches against million+ identity galleries with attention to false positives or embedding drift.

I’m looking at swapping my current face recognition stack for LVFace (the ByteDance paper from ICCV 2025) and wanted to see if anyone has real-world benchmarks yet.

Currently, I’m running a standard InsightFace-style pipeline: SCRFD (det_10g) feeding into the Buffalo_L (ArcFace) models. It’s reliable, and I've tuned it to run quickly and with predictable VRAM usage in a long-running environment, but LVFace uses a Vision Transformer (ViT) backbone instead of the usual ResNet/CNN setup, and it supposedly took 1st place in the MFR-Ongoing challenge.

In particular, I'm interested in better facial discrimination and recall performance on partially occluded (e.g. mask-wearing) faces. ArcFace tends to get confused by masks, it will happily compute nonsense embeddings for the masked part of the face rather than say "Oh, that's a mask, let me focus more on the peri-orbital region and give that more weight in the embedding".

LVFace supposedly solves this. I've done some small scale testing but wondering if anyone's tried using it in production. If you’ve tested it, I’m curious about:

Inference Speed: ViTs can be heavy—how much slower is it compared to the r50 Buffalo model in practice?
VRAM Usage: Is the footprint manageable for high-concurrency batching?
Masks/Occlusions: It won the Masked Face Recognition challenge, but does that actually translate to better field performance for you?
Recall at Scale: Any issues with embedding drift or false positives when searching against a million+ identity gallery?

Links: