DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use
arXiv cs.AI / 3/13/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper argues that insufficient diversity in synthesized agentic tasks causes brittleness in generalization for post-training tool-using LLMs.
- DIVE inverts the synthesis process by executing diverse, real-world tools first and deriving tasks only from the resulting traces, providing grounding by construction.
- It scales diversity along two axes—tool-pool coverage and per-task toolset variety—and uses an evidence-collection loop to derive richer multi-step tool-use patterns across 373 tools in five domains.
- Empirically, training Qwen3-8B on DIVE data yields +22 average points on 9 out-of-domain benchmarks and +68 over the strongest 8B baseline, with diversity scaling outperforming mere quantity scaling even with 4x less data.
Related Articles

Astral to Join OpenAI
Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic
Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.
Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA