Generalization in LLM Problem Solving: The Case of the Shortest Path

arXiv cs.AI / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a controlled synthetic benchmark using shortest-path planning to study whether LLMs can systematically generalize.
  • It separates multiple confounding factors—training data, training paradigms, and inference-time strategies—and evaluates two generalization axes: spatial transfer to unseen maps and length scaling to longer horizons.
  • Results show strong spatial transfer to new maps, but persistent failures when problem lengths increase, attributed to recursive instability.
  • The authors analyze the learning pipeline and find that data coverage limits overall capability, reinforcement learning mainly improves training stability without extending capability, and inference-time scaling boosts performance but cannot fix length-scaling failures.
  • The study suggests that some generalization failures are structural (e.g., instability under recursion) rather than simply improvable by better inference-time tactics.

Abstract

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.