A Near-Raw Talking-Head Video Dataset for Various Computer Vision Tasks

arXiv cs.CV / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces and open-sources a near-raw talking-head video dataset comprising 847 recordings (about 212 minutes) collected from 805 participants using 446 consumer webcam devices in natural environments.
All videos are stored with the FFV1 lossless codec and include MOS-based perceptual quality annotations plus ten quality tokens that explain 64.4% of MOS variance.
The authors provide a stratified benchmarking subset of 120 clips covering three content conditions: original, background blur, and background replacement.
Codec-efficiency experiments across H.264/H.265/H.266/AV1 show up to 71.3% VMAF BD-rate savings versus H.264, with strong interactions indicating that both encoder choice and background processing/content type affect compression performance.
The dataset is positioned as a significantly larger and higher-fidelity alternative to prior talking-head webcam datasets, intended for training and benchmarking video compression and enhancement models for real-time communication.

Abstract

Talking-head videos constitute a predominant content type in real-time communication, yet publicly available datasets for video processing research in this domain remain scarce and limited in signal fidelity. In this paper, we open-source a near-raw dataset of 847 talking-head recordings (approximately 212 minutes), each 15\,s in duration, captured from 805 participants using 446 unique consumer webcam devices in their natural environments. All recordings are stored using the FFV1 lossless codec, preserving the camera-native signal -- uncompressed (24.4\%) or MJPEG-encoded (75.6\%) -- without additional lossy processing. Each recording is annotated with a Mean Opinion Score (MOS) and ten perceptual quality tokens that jointly explain 64.4\% of the MOS variance. From this corpus, we curate a stratified benchmarking subset of 120 clips in three content conditions: original, background blur, and background replacement. Codec efficiency evaluation across four datasets and four codecs, namely H.264, H.265, H.266, and AV1, yields VMAF BD-rate savings up to

-71.3\%

(H.266) relative to H.264, with significant encoder

\times

dataset (

\eta_p^2 = .112

) and encoder

\times

content condition (

\eta_p^2 = .149

) interactions, demonstrating that both content type and background processing affect compression efficiency. The dataset offers 5

\times

the scale of the largest prior talking-head webcam dataset (847 vs.\ 160 clips) with lossless signal fidelity, establishing a resource for training and benchmarking video compression and enhancement models in real-time communication.