Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper focuses on estimating a person’s proficiency (how well they perform) rather than identifying the action itself, which is important for coaching, rehabilitation, and talent scouting.
  • It summarizes three advances for multi-view proficiency estimation on the Ego-Exo4D dataset: SkillFormer (parameter-efficient selective multi-view fusion), PATS (temporal sampling that keeps locally dense excerpts), and ProfVLM (turning proficiency estimation into conditional language generation).
  • ProfVLM is designed to output both a proficiency score/label and expert-style feedback, moving the task from closed-set classification toward more interpretable and actionable outputs.
  • Across the reported experiments, the combined approach reaches state-of-the-art accuracy on Ego-Exo4D while using up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines.
  • The overall trend highlighted is a shift toward efficient, multi-view systems that integrate selective fusion, proficiency-aware temporal sampling, and generative feedback that can guide users.

Abstract

Estimating how well a person performs an action, rather than which action is performed, is central to coaching, rehabilitation, and talent identification. This task is challenging because proficiency is encoded in subtle differences in timing, balance, body mechanics, and execution, often distributed across multiple views and short temporal events. We discuss three recent contributions to multi-view proficiency estimation on Ego-Exo4D. SkillFormer introduces a parameter-efficient discriminative architecture for selective multi-view fusion; PATS improves temporal sampling by preserving locally dense excerpts of fundamental movements; and ProfVLM reformulates proficiency estimation as conditional language generation, producing both a proficiency label and expert-style feedback through a gated cross-view projector and a compact language backbone. Together, these methods achieve state-of-the-art accuracy on Ego-Exo4D with up to 20x fewer trainable parameters and up to 3x fewer training epochs than video-transformer baselines, while moving from closed-set classification toward interpretable feedback generation. These results highlight a shift toward efficient, multi-view systems that combine selective fusion, proficiency-aware sampling, and actionable generative feedback.