MARS: Enabling Autoregressive Models Multi-Token Generation

arXiv cs.CL / 4/9/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

MARS（Mask AutoRegreSsion）は、既存のinstruction-tunedな自己回帰（AR）言語モデルに軽量な追加学習を行い、1回のforward passで複数トークンを予測できるようにする手法を提案している。
MARSは新しいアーキテクチャ変更や追加パラメータを必要とせず、元のARモデルと同じ呼び出し手順のまま性能劣化なしで多トークン生成に対応する。
通常の1トークン/ステップの生成では6つの標準ベンチマークでARベースラインに匹敵または上回り、複数トークン/ステップでもベースライン精度を維持しつつ1.5〜1.7倍のスループットを実現する。
さらにブロック単位のKVキャッシュ戦略によりバッチ推論で最大1.71倍のウォールクロックスピードアップを示し、確信度（confidence）しきい値によるリアルタイムの速度調整（高負荷時にスループットを増やす）も可能としている。

Abstract

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/9DailyView insight →

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

MARS: Enabling Autoregressive Models Multi-Token Generation

Key Points

Abstract

💡 Insights using this article

Related Articles

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer