FaceLiVTv2: An Improved Hybrid Architecture for Efficient Mobile Face Recognition

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces FaceLiVTv2, a lightweight hybrid CNN–Transformer architecture aimed at improving the accuracy–efficiency trade-off for mobile and edge face recognition under tight latency, memory, and energy constraints.
  • FaceLiVTv2’s key innovation is Lite MHLA, which replaces a heavier multi-layer attention design with multi-head linear token projections and affine rescale transformations to reduce redundancy while maintaining diversity across attention heads.
  • The model integrates Lite MHLA into a unified RepMix block to coordinate global–local feature interactions and uses global depthwise convolution for adaptive spatial aggregation during embedding generation.
  • Experiments on benchmarks including LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show consistent accuracy improvements over existing lightweight methods while boosting runtime efficiency.
  • Reported performance gains include a 22% reduction in mobile inference latency vs. FaceLiVTv1 and up to 30.8% speedups over GhostFaceNets, with additional 20–41% latency improvements over EdgeFace and KANFace while retaining higher recognition accuracy.

Abstract

Lightweight face recognition is increasingly important for deployment on edge and mobile devices, where strict constraints on latency, memory, and energy consumption must be met alongside reliable accuracy. Although recent hybrid CNN-Transformer architectures have advanced global context modeling, striking an effective balance between recognition performance and computational efficiency remains an open challenge. In this work, we present FaceLiVTv2, an improved version of our FaceLiVT hybrid architecture designed for efficient global--local feature interaction in mobile face recognition. At its core is Lite MHLA, a lightweight global token interaction module that replaces the original multi-layer attention design with multi-head linear token projections and affine rescale transformations, reducing redundancy while preserving representational diversity across heads. We further integrate Lite MHLA into a unified RepMix block that coordinates local and global feature interactions and adopts global depthwise convolution for adaptive spatial aggregation in the embedding stage. Under our experimental setup, results on LFW, CA-LFW, CP-LFW, CFP-FP, AgeDB-30, and IJB show that FaceLiVTv2 consistently improves the accuracy-efficiency trade-off over existing lightweight methods. Notably, FaceLiVTv2 reduces mobile inference latency by 22% relative to FaceLiVTv1, achieves speedups of up to 30.8% over GhostFaceNets on mobile devices, and delivers 20-41% latency improvements over EdgeFace and KANFace across platforms while maintaining higher recognition accuracy. These results demonstrate that FaceLiVTv2 offers a practical and deployable solution for real-time face recognition. Code is available at https://github.com/novendrastywn/FaceLiVT.