Don't Make the LLM Read the Graph: Make the Graph Think

arXiv cs.AI / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The study finds that whether explicit belief graphs help LLMs in cooperative multi-agent reasoning depends strongly on the integration architecture and model strength.
In Hanabi with controlled trials across four LLM families, belief graphs are mostly decorative for strong models when used as prompt context, but they become crucial when they gate action selection via ranked shortlists.
The research identifies a failure mode called “Planner Defiance,” where some model families override correct planner recommendations at partial competence, with large differences observed across Gemini and Llama 70B.
Full-game experiments show that inter-agent conventions plus properly combined belief-graph components outperform single-agent interventions, and preliminary scaling results suggest shallow graphs may offer the best cost-benefit while deeper ToM graphs can degrade performance at larger player counts.
Overall, the paper argues for shifting from “making the LLM read the graph” to “making the graph think,” by using graph structure to drive decision-making rather than just providing information.

Abstract

We investigate whether explicit belief graphs improve LLM performance in cooperative multi-agent reasoning. Through 3,000+ controlled trials across four LLM families in the cooperative card game Hanabi, we establish four findings. First, integration architecture determines whether belief graphs provide value: as prompt context, graphs are decorative for strong models and beneficial only for weak models on 2nd-order Theory of Mind (80% vs 10%, p<0.0001, OR=36.0); when graphs gate action selection through ranked shortlists, they become structurally essential even for strong models (100% vs 20% on 2nd-order ToM, p<0.001). Second, we identify "Planner Defiance," a model-family-specific failure where LLMs override correct planner recommendations at partial competence (90% override, replicated N=20); Gemini models show near-zero defiance while Llama 70B shows 90%, and models distinguish factual context (deferred to) from advisory recommendations (overridden). Third, full-game evidence confirms inter-agent conventions (+128% over baseline, p=0.003) outperform all single-agent interventions, and individual belief-graph components must be combined to produce gains. Fourth, preliminary scaling analysis (N=10/cell, exploratory) suggests graph depth has diminishing returns: shallow graphs provide the best cost-benefit ratio, while deeper ToM graphs appear harmful at larger player counts (-1.5 pts at 5-player, p=0.029).

A beginner's guide to the Gemini-2.5-Flash model by Google on Replicate

Dev.to

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

Reddit r/LocalLLaMA

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Simon Willison's Blog

Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

Reddit r/LocalLLaMA

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

Reddit r/LocalLLaMA

Don't Make the LLM Read the Graph: Make the Graph Think

Key Points

Abstract

Related Articles

A beginner's guide to the Gemini-2.5-Flash model by Google on Replicate

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer