AVControl: Efficient Framework for Training Audio-Visual Controls

arXiv cs.CV / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

AVControl introduces a lightweight, extendable framework for training audio-visual control of video generation using LTX-2, where each modality is learned as a separate LoRA without requiring architectural changes.
The method uses a “parallel canvas” that injects the reference signal as additional tokens in attention layers, enabling structural control that fails when extending image in-context methods to video.
Experiments on the VACE Benchmark show AVControl outperforms prior baselines for depth- and pose-guided generation tasks including inpainting and outpainting, with competitive results for camera control and audio-visual benchmarks.
The framework supports many independently trained control modalities—spatial controls (depth/pose/edges), camera trajectories with intrinsics, sparse motion, video editing—and presents modular audio-visual controls for joint generation, reportedly among the first in this direction.
The paper emphasizes efficiency, reporting that each modality can be trained with small datasets and converges in a few hundred to a few thousand steps, and it includes public code and trained LoRA checkpoints.

Abstract

Controlling video and audio generation requires diverse modalities, from depth and pose to camera trajectories and audio transformations, yet existing approaches either train a single monolithic model for a fixed set of controls or introduce costly architectural changes for each new modality. We introduce AVControl, a lightweight, extendable framework built on LTX-2, a joint audio-visual foundation model, where each control modality is trained as a separate LoRA on a parallel canvas that provides the reference signal as additional tokens in the attention layers, requiring no architectural changes beyond the LoRA adapters themselves. We show that simply extending image-based in-context methods to video fails for structural control, and that our parallel canvas approach resolves this. On the VACE Benchmark, we outperform all evaluated baselines on depth- and pose-guided generation, inpainting, and outpainting, and show competitive results on camera control and audio-visual benchmarks. Our framework supports a diverse set of independently trained modalities: spatially-aligned controls such as depth, pose, and edges, camera trajectory with intrinsics, sparse motion control, video editing, and, to our knowledge, the first modular audio-visual controls for a joint generation model. Our method is both compute- and data-efficient: each modality requires only a small dataset and converges within a few hundred to a few thousand training steps, a fraction of the budget of monolithic alternatives. We publicly release our code and trained LoRA checkpoints.