Most AI agents get one shot. They take a question, run a search or plan, give an answer, and move on. If the answer is wrong, that failure is lost. The agent starts fresh next time with no memory of what went wrong.
Humans do not work this way. We fail, think about why, and try again with a better plan. From December 2025 to March 2026, three independent research teams at AI2, EPFL, and Tsinghua University arrived at the same idea. Give the agent multiple tries. Make it reflect on each failure. Feed that reflection into the next attempt. They call it Meta-Reinforcement Learning with Self-Reflection.
Why single-shot agents fall short
Standard RL-trained agents treat each attempt as independent. They cannot carry lessons from one try to the next. Three problems come together here.
Sparse rewards make it hard to learn. The agent only gets a signal at the end (right or wrong), so it cannot tell which steps were good and which were bad. Independent tries mean the agent repeats the same mistakes. And as RL training continues, the agent converges to a fixed behavior and stops exploring new strategies. LaMer showed this with trajectory diversity analysis. After RL training, agents had much lower entropy in their action patterns compared to the base model.
Meta-RL with Self-Reflection solves all three. The design is simple. Allow three attempts per problem. After each attempt, the agent writes what went wrong and what to try next. That reflection text goes into the context for the next attempt. During training, the system optimizes cross-episode rewards, so the model learns how to write useful reflections.
The key point is that at test time, there are no weight updates. The agent adapts by adding past episodes and reflection text to its context window. LaMer calls this in-context policy adaptation. It means you do not need online learning after deployment.
What three teams found
Three teams tested this pattern in different task domains. Their results show it works across search, games, web tasks, and multi-agent environments.
AI2’s MR-Search targets search QA. Using Qwen2.5-7B, it improved QA benchmark average accuracy by 9.3% relative. With a smaller 3B model, the gain reached 19.3%. MR-Search uses turn-level advantage estimation to assign credit to each intermediate step, not just the final answer. It also scales beyond training. Even though the model trained with 3 attempts, performance keeps improving with 5 or 7 attempts at test time. (arXiv:2603.11327)
EPFL’s LaMer works on games and web tasks. Using Qwen3-4B, it improved pass@3 success rates by 11.8 points on Sokoban, 19.3 points on MineSweeper, and 13.9 points on Webshop versus the best RL baseline. One finding stands out. Keeping only reflection text in memory works better than the default setting of keeping both trajectory and reflection. On MineSweeper, reflection-only scored 80.5% versus 74.4% for full history. Reflections are shorter and carry more useful information per token. (arXiv:2512.16848, ICLR 2026)
Tsinghua’s MAGE extends this to multi-agent settings. It focuses on strategic exploitation, finding and using an opponent’s weaknesses. MAGE reached 100% success rate on Webshop (versus 75.2% for GiGPO) and 67.2% on Tic-Tac-Toe against MCTS-100 (versus 60.2% for LaMer). Against MCTS-1000, a near-perfect opponent in Tic-Tac-Toe, MAGE achieved a 100% draw rate through zero-shot adaptation. (arXiv:2603.03680)
The three frameworks differ in some design choices. MR-Search uses no discount between episodes (gamma=1.0), while LaMer and MAGE use 0.6. MAGE uses differential return, which rewards improvement over the previous episode rather than total score. MAGE’s ablation study showed differential return produces more stable learning than cumulative return. The three papers also use different metrics (Exact Match vs. pass@k), so direct number comparisons between them are not valid.
Caveats
All results come from the authors’ own experiments. Large-scale independent reproduction is still limited. LaMer has been peer-reviewed at ICLR 2026. MR-Search and MAGE are preprints. MR-Search code is expected on March 21, 2026. LaMer and MAGE code is already public.
Base models are small, 4B to 7B parameters. No one has tested this on 70B+ models yet. Training takes about twice as long as standard RL because episodes must be generated one after another, not in parallel. LaMer reported this cost.
Reflection quality is a risk. LLM hallucinations can creep into reflections. A wrong reflection may hurt performance more than no reflection at all. None of the three papers propose a direct fix for this. Context length is another limit. Episodes and reflections pile up fast, and long tasks will lose information.
Conclusion
The shift is from “get it right the first time” to “fail, reflect, and improve.” Three independent teams converged on this pattern at the same time. That convergence itself is a signal. For agent builders, the takeaway is practical. Question the assumption that your agent must finish in one try. Build in room for exploration and reflection. The mechanism is lightweight. No weight updates at test time. Just context.






