Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization
arXiv cs.CL / 3/17/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The method uses a multi-agent workflow grounded in Grounded Theory to dynamically produce reusable evaluation criteria.
- MRPO enables self-reflection in models by using these dynamic criteria to guide iterative improvement without extra training.
- The training combines supervised fine-tuning with reinforcement learning to turn criteria into reward signals for end-to-end optimization.
- Experiments show writer models trained with MRPO outperform baselines on several creative writing tasks and even surpass some 100B+ parameter open-source models.




