AI Navigate

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

arXiv cs.CL / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • MEAP is a training paradigm that integrates Masked Language Modeling into Next-Token Prediction by masking a small fraction of input tokens and performing autoregressive decoding with a decoder-only Transformer, removing the need for bidirectional attention or encoder-decoder MLM.
  • It imposes no additional computational overhead during pre-training or inference and significantly improves in-context retrieval and long-context reasoning, outperforming standard NTP on key information retrieval tasks and maintaining or improving performance on commonsense reasoning.
  • In supervised fine-tuning, MEAP shows substantial advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points.
  • The authors attribute MEAP’s effectiveness to more distinguishable attention scores arising from focusing on a reduced set of non-masked tokens, which helps the model attend to task-relevant signals.
  • These findings position MEAP as a promising training paradigm for large language models with potential wide-ranging impact on model training and deployment.

Abstract

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points. Our analysis indicates that MEAP's effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model's focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models.