VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

arXiv cs.LG / 4/29/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes an efficient, resolution-agnostic autoregressive image synthesis method that can generate images across arbitrary resolutions and aspect ratios.
It introduces VibeToken, a 1D Transformer-based image tokenizer that represents an image as a dynamic, user-controllable sequence of 32–256 tokens, aiming for a strong efficiency–quality trade-off.
Building on that, VibeToken-Gen is a class-conditioned autoregressive generator that supports arbitrary resolutions while using substantially fewer compute resources than diffusion baselines.
The authors report that VibeToken-Gen can synthesize 1024×1024 images using only 64 tokens and achieves 3.94 gFID, outperforming a diffusion state-of-the-art comparison that uses 1,024 tokens and gets 5.87 gFID.
Unlike fixed-resolution autoregressive models whose inference compute grows quadratically with resolution, VibeToken-Gen keeps compute constant at 179G FLOPs (63.4× efficiency) regardless of resolution, potentially easing deployment in production.

Abstract

We introduce an efficient, resolution-agnostic autoregressive (AR) image synthesis approach that generalizes to arbitrary resolutions and aspect ratios, narrowing the gap to diffusion models at scale. At its core is VibeToken, a novel resolution-agnostic 1D Transformer-based image tokenizer that encodes images into a dynamic, user-controllable sequence of 32-256 tokens, achieving a state-of-the-art efficiency and performance trade-off. Building on VibeToken, we present VibeToken-Gen, a class-conditioned AR generator with out-of-the-box support for arbitrary resolutions while requiring significantly fewer compute resources. Notably, VibeToken-Gen synthesizes 1024x1024 images using only 64 tokens and achieves 3.94 gFID; by comparison, a diffusion-based state-of-the-art alternative requires 1,024 tokens and attains 5.87 gFID. In contrast to fixed-resolution AR models such as LlamaGen -- whose inference FLOPs grow quadratically with resolution (11T FLOPs at 1024x1024) -- VibeToken-Gen maintains a constant 179G FLOPs (63.4x efficient) independent of resolution. We hope VibeToken can help unlock the wide adoption of AR visual generative models in production use cases.

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

Dev.to

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

Key Points

Abstract

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

HubSpot Just Legitimized AEO: What It Means for Your Brand AI Visibility

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer