Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling

arXiv cs.AI / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes several test-time inference scaling strategies for LLMs—self-consistency, self-refinement, multi-agent debate, and mixture-of-agents—focusing on compute-cost tradeoffs rather than only accuracy.
  • Experiments across two reasoning benchmarks (MMLU-Pro and BBH) and 34 configurations evaluate how changing parallel samples, number of agents, and debate rounds affects performance under different model sizes.
  • Using Pareto-optimal analysis, the authors identify methods that deliver the best accuracy for the lowest computational budget, showing that scaling can improve accuracy by up to +7.1 percentage points over chain-of-thought at the highest tested budgets (20× compute).
  • Under equal compute budgets, multi-agent debate and mixture-of-agents outperform self-consistency by 1.3 and 2.7 percentage points, respectively, and the benefits of multi-agent approaches persist longer on harder tasks.
  • The study proposes a practical design guideline: mixture-of-agents tends to be most efficient when the number of parallel generations is larger than the number of sequential aggregations.

Abstract

Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency is key for real-world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies self-consistency, self-refinement, multi-agent debate, and mixture-of-agents, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU-Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto-optimal front to select methods that achieve the best accuracy with the lowest computational budget. Notably, inference scaling improves accuracy by up to +7.1% points over chain-of-thought at the highest evaluated budgets (20x the CoT compute budget) on MMLU-Pro. With an equal computing budget, debate and mixture-of-agents outperform self-consistency by 1.3% and 2.7% points, respectively. While self-consistency saturates earlier, multi-agent gains persist, particularly on more complicated tasks. We identify a simple multi-agent design guideline: mixture-of-agents is most efficient when the number of parallel generations exceeds the number of sequential aggregations.