TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
arXiv cs.CV / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The paper addresses the unsolved challenge of generating coherent videos from complex temporal descriptions involving multiple sequential actions.
- It identifies two main failure causes in existing approaches: temporal misalignment between the video and the prompt, and conflicting attention coupling between motion-related visual elements and their text conditions.
- The authors propose TS-Attn, a training-free attention mechanism that dynamically rearranges attention to improve both temporal awareness and global coherence for multi-event scenarios.
- TS-Attn can be added to various pre-trained text-to-video models, improving StoryEval-Bench scores by 33.5% (Wan2.1-T2V-14B) and 16.4% (Wan2.2-T2V-A14B) with only about a 2% increase in inference time.
- The method is designed for plug-and-play use, including multi-event image-to-video generation, and the project code is released on GitHub.

![AI TikTok Marketing for Pet Brands [2026 Guide]](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Fj35r9qm34d68qf2gq7no.png&w=3840&q=75)


