Efficient Autoregressive Inference for Transformer Probabilistic Models

arXiv stat.ML / 4/22/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • Set-based transformer probabilistic models can do single-pass marginal predictions well, but producing joint distributions typically requires costly re-encoding of the entire context at every autoregressive step.
  • The paper proposes a causal autoregressive buffer that caches the context once and incrementally buffers previously generated targets, letting new predictions attend to both cached context and buffered targets.
  • This design supports efficient batched autoregressive sampling and joint predictive density evaluation without the quadratic-like overhead of repeated full re-encoding.
  • The training approach blends set-based and autoregressive modes via masked attention with minimal added overhead.
  • Experiments on synthetic functions, EEG time series, Bayesian model comparison, and tabular regression show up to 20× faster joint sampling/density evaluation and up to 7× lower memory usage while matching full re-encoding performance.

Abstract

Set-based transformer models for amortized probabilistic inference and meta-learning, such as neural processes, prior-fitted networks, and tabular foundation models, excel at single-pass marginal prediction. However, many applications require joint distributions over multiple predictions. Purely autoregressive architectures generate these efficiently but sacrifice flexible set-conditioning. Obtaining joint distributions from set-based models requires re-encoding the entire context at each autoregressive step, which scales poorly. We introduce a causal autoregressive buffer that combines the strengths of both paradigms. The model encodes the context once and caches it; a lightweight causal buffer captures dependencies among generated targets, with each new prediction attending to both the cached context and all previously predicted targets added to the buffer. This enables efficient batched autoregressive sampling and joint predictive density evaluation. Training integrates set-based and autoregressive modes through masked attention at minimal overhead. Across synthetic functions, EEG time series, a Bayesian model comparison task, and tabular regression, our method closely matches the performance of full context re-encoding while delivering up to 20\times faster joint sampling and density evaluation, and up to 7\times lower memory usage.

Efficient Autoregressive Inference for Transformer Probabilistic Models | AI Navigate