PriorNet: Prior-Guided Engagement Estimation from Face Video

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • PriorNet addresses the difficulty of engagement estimation from face video by explicitly handling incomplete facial evidence and subjective/limited labels.
  • The framework injects task-relevant priors across three stages: preprocessing (using zero-frame placeholders when face detection fails), model adaptation, and objective design.
  • It adapts a frozen self-supervised video facial affect backbone (SVFAP) using Prior-guided Low-Rank Adaptation (Prior-LoRA) for parameter-efficient specialization.
  • PriorNet trains with a Dirichlet-evidential, uncertainty-weighted loss under hard-label supervision to better account for uncertainty.
  • Experiments on EngageNet, DAiSEE, DREAMS, and PAFE show consistent improvements over prior references, and ablations suggest gains come from complementary preprocessing, adaptation, and objective priors.

Abstract

Engagement estimation from face video remains challenging because facial evidence is often incomplete, labeled data are limited, and engagement annotations are subjective. We present PriorNet, a prior-guided framework that injects task-relevant priors at three stages of the pipeline: preprocessing, model adaptation, and objective design. PriorNet converts face-detection failures into explicit zero-frame placeholders so that missing-face events remain represented in the input sequence, adapts a frozen Self-supervised Video Facial Affect Perceiver (SVFAP) backbone through a Prior-guided Low-Rank Adaptation module (Prior-LoRA) for parameter-efficient specialization, and trains with a Dirichlet-evidential, uncertainty-weighted objective under hard-label supervision. We evaluate PriorNet on EngageNet, DAiSEE, DREAMS, and PAFE using each dataset's native evaluation protocol. Across these benchmarks, PriorNet improves over the strongest listed prior reference within each dataset's evaluation framing, while component ablations on EngageNet and DAiSEE indicate that the gains arise from complementary contributions of preprocessing, adaptation, and objective-level priors. These results support explicit prior injection as a useful design principle for face-video engagement estimation under the benchmark conditions studied in this work.