WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models

arXiv cs.CL / 4/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces WebAggregator, a training pipeline aimed at improving Deep Research agents by shifting them from retrieval-heavy, reasoning-light behavior to compositional information aggregation.
  • WebAggregator uses a two-stage process—Proactive Explorer for collecting interconnected knowledge and Compositional Logic Proposer for building complex answers using 12+ composition guidelines.
  • The authors curate a high-quality SFT dataset from 10K verifiable QA pairs sourced from 50K websites, applying rejection sampling to reduce noise and redundancy.
  • After fine-tuning, the WebAggregator-32B model is reported to outperform GPT-4.1 and match Claude-3.7-Sonnet on multiple benchmarks, and the new WebAggregatorQA testbed suggests reasoning—not retrieval—is the primary performance bottleneck.
  • The study also highlights a benchmark gap by proposing an evaluation setup that jointly stresses retrieval and reasoning, finding that strong retrieval alone does not guarantee top performance.

Abstract

The hallmark of Deep Research agents lies in compositional reasoning, the capacity to aggregate distributed, heterogeneous information into coherent logical insights. However, current agentic systems are often retrieval-heavy but reasoning-light, where success is predominantly determined by simple entity-seeking rather than the multi-step aggregation of scattered evidence. To address this, we propose a data synthesis pipeline WebAggregator, designed to shift the agentic paradigm from retrieval-centric to compositional aggregation. Our approach first employs Proactive Explorer to collect interconnected knowledge, then Compositional Logic Proposer to weave knowledge into complex questions using over 12 composition guidelines derived from a rigorous deconstruction of the Deep Research problem setting. By leveraging 10K verifiable QA pairs grounded on 50K websites, we curate a high-quality SFT dataset via rejection sampling. Fine-tuning on this corpus fundamentally transforms agent behavior, fostering deliberate composition reasoning and reduced tool redundancy. The resulting WebAggregator-32B surpasses GPT-4.1 and matches Claude-3.7-Sonnet on GAIA, WebWalkerQA, and XBench. To address the lack of benchmarks that emphasize both reasoning and retrieval, we introduce the WebAggregatorQA testbed, which reveals that even with perfect retrieval, top-tier models still underperformed. These results demonstrate that compositional reasoning, not retrieval, is the true performance ceiling for next-generation research agents.