ELT: Elastic Looped Transformers for Visual Generation

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces Elastic Looped Transformers (ELT), a parameter-efficient visual generative model that reuses weight-shared recurrent transformer blocks instead of stacking many unique layers.
To train ELT effectively for image and video generation, the authors propose Intra-Loop Self Distillation (ILSD), distilling intermediate “student” loop configurations from a “teacher” configuration within a single training step.
A key capability of ELT is generating a whole family of “elastic” models from one training run, enabling any-time inference with controllable compute–quality trade-offs without changing the parameter count.
The reported efficiency improvements include a 4× parameter reduction under iso-inference-compute conditions while achieving FID 2.0 on ImageNet 256×256 (class-conditional) and FVD 72.8 on UCF-101 (class-conditional).

Abstract

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With

4\times

reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of

2.0

on class-conditional ImageNet

256 \times 256

and FVD of

72.8

on class-conditional UCF-101.

Black Hat Asia

AI Business

Apple is building smart glasses without a display to serve as an AI wearable

THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

Dev.to

ELT: Elastic Looped Transformers for Visual Generation

Key Points

Abstract

Related Articles

Black Hat Asia

Apple is building smart glasses without a display to serve as an AI wearable

Why Fashion Trend Prediction Isn’t Enough Without Generative AI

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer