Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation
arXiv cs.CV / 4/7/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies multi-reference, multi-shot video generation and pinpoints “reference confusion” as a core failure mode when reference images have very similar appearances.
- It argues that semantic retrieval alone is insufficient because semantically similar tokens can cause the model to retrieve the wrong context even when the references are visually close.
- To mitigate this, the authors propose PoCo (Position Embedding as a Context Controller), which uses positional encoding as extra token-level context control to enable more precise matching.
- The resulting multi-reference, multi-shot video generation model built on PoCo is designed to reliably control characters with extremely similar visual traits.
- Experiments show PoCo improves cross-shot consistency and reference fidelity versus multiple baseline approaches.
Related Articles

Black Hat Asia
AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter
TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to