AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

arXiv cs.CV / 5/6/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • AniMatrix (arXiv:2605.03652v1) is a new anime-focused video generation model designed to prioritize artistic correctness over physical realism.
  • It uses a dual-channel conditioning approach: a Production Knowledge System that encodes anime production variables (style, motion, camera, VFX) plus an AniCaption module that infers those directives from pixels.
  • A structured injection mechanism combines cross-attention for fine control with AdaLN modulation for global enforcement so categorical anime directives are not overridden by free-form text.
  • Training is guided by a style–motion–deformation curriculum and uses deformation-aware preference optimization with a domain-specific reward model to distinguish intentional art from failure.
  • In human evaluations by professional animators across five production dimensions, AniMatrix ranks first in four out of five and shows notable improvements over Seedance-Pro 1.0 in prompt understanding and artistic motion, and the team plans to publicly release model weights and inference code.

Abstract

Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artistry that defines the medium or collapse under its stylistic variance. We present AniMatrix, a video generation model that targets artistic rather than physical correctness through a dual-channel conditioning mechanism and a three-step transition: redefine correctness, override the physics prior, and distinguish art from failure. First, a Production Knowledge System encodes anime as a structured taxonomy of controllable production variables (Style, Motion, Camera, VFX), and AniCaption infers these variables from pixels as directorial directives. A trainable tag encoder preserves the field-value structure of this taxonomy while a frozen T5 encoder handles free-form narrative; dual-path injection (cross-attention for fine-grained control, AdaLN modulation for global enforcement) ensures categorical directives are never diluted by open-ended text. Second, a style-motion-deformation curriculum transitions the model from near-physical motion to full anime expressiveness. Third, deformation-aware preference optimization with a domain-specific reward model separates intentional artistry from pathological collapse. On an anime-specific human evaluation with five production dimensions scored by professional animators, AniMatrix ranks first on four of five, with the largest gains over Seedance-Pro 1.0 on Prompt Understanding (+0.70, +22.4 percent) and Artistic Motion (+0.55, +16.9 percent). We will publicly release the AniMatrix model weights and inference code.