Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

共有:

Key Points

Astrolabe is an efficient online reinforcement learning framework tailored for distilled autoregressive video models to improve alignment with human visual preferences without expensive re-distillation or solver-coupled reverse-process optimization.
It introduces a forward-process RL formulation called negative-aware fine-tuning that uses direct positive/negative sample contrasts at inference endpoints to guide policy improvement without reverse-process unrolling.
It enables scalable long-video alignment via a streaming training scheme with a rolling KV-cache, updating RL only within local clip windows while conditioning on prior context to maintain long-range coherence.
To counter reward hacking, it combines a multi-reward objective with uncertainty-aware selective regularization and dynamic reference updates, and experiments show improved generation quality across multiple distilled AR video models.

Abstract

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

Self-Refining Agents in Spec-Driven Development

Dev.to

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

Reddit r/LocalLLaMA

M2.7 open weights coming in ~2 weeks

Reddit r/LocalLLaMA

MiniMax M2.7 Will Be Open Weights

Reddit r/LocalLLaMA

Best open source coding models for claude code? LB?

Reddit r/LocalLLaMA

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

has anyone tried this? Flash-MoE: Running a 397B Parameter Model on a Laptop

M2.7 open weights coming in ~2 weeks

MiniMax M2.7 Will Be Open Weights

Best open source coding models for claude code? LB?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer