AesRM: Improving Video Aesthetics with Expert-Level Feedback
arXiv cs.CV / 5/1/2026
📰 NewsModels & Research
Key Points
- The paper introduces AesRM and an evaluation framework that treat video aesthetics as a hierarchical rubric covering three dimensions—Visual Aesthetics (VA), Visual Fidelity (VF), and Visual Plausibility (VP)—with 15 fine-grained criteria such as shot composition.
- It creates large-scale expert-annotated preference data and a benchmark, AesVideo-Bench, using about 2,500 video pairs labeled by experts across VA, VF, and VP.
- Two reward-model variants are proposed: AesRM-Base predicts pairwise preferences for efficient reward signals, while AesRM-CoT also produces criterion-aligned chain-of-thought to improve interpretability.
- Training uses a three-stage progressive scheme (Atomic Aesthetic Capability Learning, Cold-Start, and GRPO), plus self-consistency-based CoT synthesis and CoT-based process rewards to strengthen CoT quality and evaluation accuracy.
- Experiments report AesRM improves performance and robustness over existing baselines and yields aesthetic gains when aligning Wan2.2 with AesRM compared with other aesthetic reward models.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’
The Register
Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats
Reddit r/LocalLLaMA
![Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fvutakjb0vgyg1.png%3Fwidth%3D140%26height%3D59%26auto%3Dwebp%26s%3D08ecb95fd65ade25c924988f1992e9abe3d79f62&w=3840&q=75)
Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]
Reddit r/MachineLearning