Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies multi-reference, multi-shot video generation and pinpoints “reference confusion” as a core failure mode when reference images have very similar appearances.
It argues that semantic retrieval alone is insufficient because semantically similar tokens can cause the model to retrieve the wrong context even when the references are visually close.
To mitigate this, the authors propose PoCo (Position Embedding as a Context Controller), which uses positional encoding as extra token-level context control to enable more precise matching.
The resulting multi-reference, multi-shot video generation model built on PoCo is designed to reliably control characters with extremely similar visual traits.
Experiments show PoCo improves cross-shot consistency and reference fidelity versus multiple baseline approaches.

Abstract

Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer