Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study compares six inference-time reasoning paradigms for LLM agents (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) across four frontier models and ten benchmarks, finding that some paradigms improve performance on certain tasks while others significantly degrade it.
  • Results show no universally best reasoning paradigm (e.g., ReAct improves GAIA by 44pp over Direct, while CoT drops HumanEval by 15pp), highlighting strong task-dependent complementarity.
  • An “oracle per-task selection” approach achieves an average improvement of 17.1pp over the best single fixed paradigm, indicating that choosing the right paradigm per task is crucial.
  • The paper proposes “select-then-solve,” where a lightweight embedding-based router selects the best paradigm for each task; across four models it raises average accuracy from 47.6% to 53.1% and recovers up to 37% of the oracle gap, outperforming the best fixed paradigm by 2.8pp.
  • The authors find that zero-shot self-routing is unreliable (only effective for GPT-5 at 67.1% and worse for weaker models), strengthening the case for a learned router for per-task paradigm selection.

Abstract

When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.