ItinBench: Benchmarking Planning Across Multiple Cognitive Dimensions with Large Language Models
arXiv cs.AI / 3/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- ItinBench introduces a benchmark that combines spatial reasoning tasks, specifically route optimization, with trip itinerary planning to evaluate LLMs across multiple cognitive dimensions.
- It evaluates several LLMs including Llama 3.1 8B, Mistral Large, Gemini 1.5 Pro, and GPT family, and finds that models struggle to maintain high and consistent performance across concurrent cognitive tasks.
- By incorporating tasks from distinct human-level cognitive domains, ItinBench provides new insights into building more comprehensive reasoning testbeds that better reflect real-world challenges.
- The project offers code and dataset at https://ethanwtl.github.io/IBweb/ to support reproducibility and further research.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)
Dev.to
The Obligor
Dev.to
The Markup
Dev.to
2026 年 AI 部落格變現完整攻略:從第一篇文章到月收入 $1000
Dev.to