Reasoning Topology Matters: Network-of-Thought for Complex Reasoning Tasks

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that existing LLM prompting structures (Chain-of-Thought and Tree-of-Thought) are limited for complex reasoning that requires merging, revisiting, and integrating evidence, and proposes a new framework called Network-of-Thought (NoT).
NoT represents reasoning as a directed graph with typed nodes and edges, using a heuristic-based controller policy to guide graph-based search and intermediate reuse.
Experiments across GSM8K, Game of 24, HotpotQA, and ProofWriter on three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct) show NoT can outperform ToT on multi-hop reasoning (e.g., HotpotQA) and sometimes achieves the best overall accuracy depending on the model.
The study finds that LLM-generated controller heuristics can outperform fixed or random strategies, and that NoT’s performance depends on the computation–accuracy tradeoff.
It also reports that evaluation methodology affects rankings substantially: string-match metrics underestimate methods (especially NoT) on open-ended QA, with reported gaps of about 14–18 percentage points on HotpotQA.

Abstract

Existing prompting paradigms structure LLM reasoning in limited topologies: Chain-of-Thought (CoT) produces linear traces, while Tree-of-Thought (ToT) performs branching search. Yet complex reasoning often requires merging intermediate results, revisiting hypotheses, and integrating evidence from multiple sources. We propose Network-of-Thought (NoT), a framework that models reasoning as a directed graph with typed nodes and edges, guided by a heuristic-based controller policy. Across four benchmarks (GSM8K, Game of 24, HotpotQA, ProofWriter) and three models (GPT-4o-mini, Llama-3.3-70B-Instruct, Qwen2.5-72B-Instruct), we investigate when network topology outperforms chain or tree structures, whether LLM-generated heuristics can guide graph-based reasoning search, and the computation-accuracy tradeoff across topologies, evaluating each method on accuracy, topology simplicity, and token efficiency. Our results show that CoT remains effective for sequential tasks with GPT-4o-mini (89.5\% on GSM8K), while NoT surpasses ToT on multi-hop reasoning (91.0\% vs.\ 88.0\% on HotpotQA with LLM-as-Judge). With 72B open-source models, NoT achieves the highest accuracy on GSM8K (91.5\%), and Qwen2.5-72B achieves the best multi-hop QA result overall (91.7\% on HotpotQA). Self-generated controller heuristics outperform fixed and random strategies on logical reasoning, with uncertainty-only weighting achieving 57.0\% on ProofWriter. We also find that evaluation methodology significantly impacts method rankings: string-match underestimates all methods on open-ended QA, with the largest gap for NoT, a pattern consistent across all three models (14--18 percentage point gap on HotpotQA).