ELT: Elastic Looped Transformers for Visual Generation

arXiv cs.CV / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Elastic Looped Transformers (ELT), a parameter-efficient visual generative model that reuses weight-shared recurrent transformer blocks instead of stacking many unique layers.
  • To train ELT effectively for image and video generation, the authors propose Intra-Loop Self Distillation (ILSD), distilling intermediate “student” loop configurations from a “teacher” configuration within a single training step.
  • A key capability of ELT is generating a whole family of “elastic” models from one training run, enabling any-time inference with controllable compute–quality trade-offs without changing the parameter count.
  • The reported efficiency improvements include a 4× parameter reduction under iso-inference-compute conditions while achieving FID 2.0 on ImageNet 256×256 (class-conditional) and FVD 72.8 on UCF-101 (class-conditional).

Abstract

We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation quality, with the same parameter count. ELT significantly shifts the efficiency frontier for visual synthesis. With 4\times reduction in parameter count under iso-inference-compute settings, ELT achieves a competitive FID of 2.0 on class-conditional ImageNet 256 \times 256 and FVD of 72.8 on class-conditional UCF-101.