WRAP++: Web discoveRy Amplified Pretraining
arXiv cs.CL / 4/9/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- WRAP++ (Web discoveRy Amplified Pretraining) targets a limitation of synthetic-data rephrasing by moving from single-document rewriting to cross-document knowledge synthesis using web hyperlinks.
- The method discovers high-confidence cross-document relational motifs (e.g., dual-links and co-mentions) and generates joint QA that forces reasoning across pairs of documents.
- By adding relational context that is not present in either source document alone, WRAP++ aims to create new entry points to the same facts and improve how LLMs learn associations.
- The discovery-driven process also increases dataset size combinatorially, and the paper reports scaling Wikipedia text from ~8.4B tokens into ~80B tokens of cross-document QA.
- Experiments on SimpleQA using OLMo-based models (7B and 32B) show substantial and sustained gains over single-document approaches, indicating benefits from cross-document knowledge amplification.
Related Articles

Black Hat Asia
AI Business

Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents
MarkTechPost
Chatbots are great at manipulating people to buy stuff, Princeton boffins find
The Register
I tested and ranked every ai companion app I tried and here's my honest breakdown
Reddit r/artificial
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to