Visual Implicit Autoregressive Modeling

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper proposes Visual Implicit Autoregressive Modeling (VIAR), which improves upon Visual Autoregressive Modeling (VAR) by inserting an implicit equilibrium layer to avoid fixed computation depth and excessive memory use at high resolutions.
VIAR trains the implicit layer using Jacobian-Free Backpropagation, enabling constant training memory, while inference provides a per-scale iteration “knob” to control compute dynamically.
On ImageNet 256×256, VIAR reports strong generative performance with FID 2.16 and sFID 8.07, using only 38.4% of VAR’s parameters while matching or outperforming strong autoregressive baselines.
The compute knob allows VIAR to reduce peak memory from 19.24 GB to 8.53 GB and increase throughput from 15.16 to 32.08 images/s on a single RTX 4090 without retraining.
Experiments indicate faster convergence with fewer fixed-point iterations and show VIAR’s advantages over VAR in quality/efficiency tradeoffs, including sharper results in zero-shot in-painting and class-conditional editing.

Abstract

Visual Autoregressive Modeling (VAR) based on next-scale prediction achieves strong generation quality, but their explicit deep stacks fix the amount of computation per scale and inflate memory at high resolutions. We introduce Visual Implicit Autoregressive Modeling (VIAR), a next-scale autoregressive generator that embeds an implicit equilibrium layer between shallow pre/post blocks. The implicit layer is trained with Jacobian-Free Backpropagation, yielding constant training memory, while inference exposes a per-scale iteration knob that enables compute control. On ImageNet 256x256 benchmark, VIAR attains FID 2.16, and sFID 8.07 with only 38.4% parameters of VAR, matching or surpassing strong AR baselines and remaining competitive with large diffusion models. By controlling the per-scale knob, VIAR can reduce peak memory from 19.24 GB to 8.53 GB and doubles throughput from 15.16 to 32.08 images/s on a single RTX 4090, without retraining. Ablations show that fewer steps are sufficient for fixed-point iterations to converge and that VIAR consistently dominates VAR across quality efficiency operating points. In zero shot in-painting and class-conditional editing, VIAR produces sharper details and smoother boundaries while preserving global structure, validating the benefits of implicit equilibria and per-scale compute control for practical, deployable visual generation.