Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
arXiv cs.CL / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- MEAP is a training paradigm that integrates Masked Language Modeling into Next-Token Prediction by masking a small fraction of input tokens and performing autoregressive decoding with a decoder-only Transformer, removing the need for bidirectional attention or encoder-decoder MLM.
- It imposes no additional computational overhead during pre-training or inference and significantly improves in-context retrieval and long-context reasoning, outperforming standard NTP on key information retrieval tasks and maintaining or improving performance on commonsense reasoning.
- In supervised fine-tuning, MEAP shows substantial advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77 percentage points.
- The authors attribute MEAP’s effectiveness to more distinguishable attention scores arising from focusing on a reduced set of non-masked tokens, which helps the model attend to task-relevant signals.
- These findings position MEAP as a promising training paradigm for large language models with potential wide-ranging impact on model training and deployment.




