Efficient Autoregressive Inference for Transformer Probabilistic Models

arXiv stat.ML / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

Set-based transformer probabilistic models can do single-pass marginal predictions well, but producing joint distributions typically requires costly re-encoding of the entire context at every autoregressive step.
The paper proposes a causal autoregressive buffer that caches the context once and incrementally buffers previously generated targets, letting new predictions attend to both cached context and buffered targets.
This design supports efficient batched autoregressive sampling and joint predictive density evaluation without the quadratic-like overhead of repeated full re-encoding.
The training approach blends set-based and autoregressive modes via masked attention with minimal added overhead.
Experiments on synthetic functions, EEG time series, Bayesian model comparison, and tabular regression show up to 20× faster joint sampling/density evaluation and up to 7× lower memory usage while matching full re-encoding performance.

Abstract

Set-based transformer models for amortized probabilistic inference and meta-learning, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many applications require joint distributions over multiple predictions. Purely autoregressive architectures generate these efficiently but sacrifice flexible set-conditioning. Obtaining joint distributions from set-based models requires re-encoding the entire context at each autoregressive step, which scales poorly. We introduce a causal autoregressive buffer that combines the strengths of both paradigms. The model encodes the context once and caches it; a lightweight causal buffer captures dependencies among generated targets, with each new prediction attending to both the cached context and all previously predicted targets added to the buffer. This enables efficient batched autoregressive sampling and joint predictive density evaluation. Training integrates set-based and autoregressive modes through masked attention at minimal overhead. Across synthetic functions, EEG time series, a Bayesian model comparison task, and tabular regression, our method closely matches the performance of full context re-encoding while delivering up to

20\times

faster joint sampling and density evaluation, and up to

7\times

lower memory usage.

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Dev.to

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

Reddit r/LocalLLaMA

Efficient Autoregressive Inference for Transformer Probabilistic Models

Key Points

Abstract

Related Articles

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Building a Visual Infrastructure Layer: How We’re Solving the "Visual Trust Gap" for E-com

Qwen3.6 35B-A3B is quite useful on 780m iGPU (llama.cpp,vulkan)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer