Subjective Portrait Region Cropping in Landscape Videos with Temporal Annotation Smoothing

arXiv cs.CV / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper addresses challenges in changing landscape videos to different aspect ratios on mobile devices, arguing that static cropping/padding and warping can reduce visual quality or distort intended meaning.
  • It proposes temporally coordinated cropping that focuses on important regions while minimizing distortion and preserving essential content across frames.
  • To enable research on subjective portrait-region cropping, the authors introduce the LIVE-YT VC dataset (1,800 videos annotated by 90 human subjects), sourced from YouTube-UGC and LSVQ, described as the largest publicly available subjective database for this task.
  • They also release a post-processed dataset variant (LIVE-YT VC++) using a new intra-frame temporal filter to smooth subjective annotations, and validate usefulness via SmartVidCrop and fine-tuned state-of-the-art video grounding models.
  • The work includes an analysis comparing their labels to video saliency annotations/predictions and plans to open-source the project for benchmarking future research.

Abstract

With the rise of mobile video consumption on diverse handheld display resolutions and orientation modes, altering videos to aspect ratios poses challenges. Static cropping and border padding often compromises visual quality, while warping may distort a video's intended meaning. Here we advocate for a more effective approach: cropping significant regions within video frames in a temporal manner, while minimizing distortion and preserving essential content. One barrier to solving this problem is the lack of sufficiently large-scale database devoted to informing these tasks. Towards filling this gap, we introduce the LIVE-YouTube Video Cropping (LIVE-YT VC) database, featuring 1800 videos, annotated by 90 human subjects. Using videos sourced from the YouTube-UGC and LSVQ Databases, this new resource is the largest publicly-available subjective video portrait region cropping database. We also introduce a post-processed version of the database, called LIVE-YT VC++, whereby a novel intra-frame temporal filter was deployed to smooth subjective annotations within each video. We demonstrate the usefulness of this new data resource using the SmartVidCrop algorithm and state-of-the-art video grounding models, in hopes of establishing our subjective dataset as a benchmark for future research. Our contributions offer a resource for advancing video aspect ratio transformation models towards ensuring that reshaped mobile-friendly video content retains its quality and meaning. Since our labels bear resemblances to video saliency annotations, we also conducted an additional analysis to explore the similarity between our labels and video saliency predictions. Finally, we repurposed state-of-the-art video grounding models for aspect ratio change tasks, and fine-tuned them on our dataset. As a service to the research community, we plan to open source the project.