Deep kernel video approximation for unsupervised action segmentation

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a method for per-video unsupervised action segmentation that works when large-scale dataset storage is limited or disallowed.
  • It segments videos by learning an approximation in deep kernel space and measuring how close the approximated frame distribution is to the original using maximum mean discrepancy (MMD).
  • The approach uses neural tangent kernels (NTKs) to define the kernel space, improving descriptive power versus fixed kernels and avoiding trivial solutions when learning both the approximation and the kernel.
  • Compared with state-of-the-art per-video techniques across six benchmarks, the method shows competitive performance and higher F1 scores, including cases where the number of segments is unknown.

Abstract

This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.