Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

arXiv cs.CL / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study compares six inference-time reasoning paradigms for LLM agents (Direct, CoT, ReAct, Plan-Execute, Reflection, ReCode) across four frontier models and ten benchmarks, finding that some paradigms improve performance on certain tasks while others significantly degrade it.
Results show no universally best reasoning paradigm (e.g., ReAct improves GAIA by 44pp over Direct, while CoT drops HumanEval by 15pp), highlighting strong task-dependent complementarity.
An “oracle per-task selection” approach achieves an average improvement of 17.1pp over the best single fixed paradigm, indicating that choosing the right paradigm per task is crucial.
The paper proposes “select-then-solve,” where a lightweight embedding-based router selects the best paradigm for each task; across four models it raises average accuracy from 47.6% to 53.1% and recovers up to 37% of the oracle gap, outperforming the best fixed paradigm by 2.8pp.
The authors find that zero-shot self-routing is unreliable (only effective for GPT-5 at 67.1% and worse for weaker models), strengthening the case for a learned router for per-task paradigm selection.

Abstract

When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer