LLM-Flax : Generalizable Robotic Task Planning via Neuro-Symbolic Approaches with Large Language Models

arXiv cs.RO / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces LLM-Flax, a three-stage neuro-symbolic robotic task planning framework that removes manual rule authoring and training-data requirements by using a locally hosted LLM with only a PDDL domain file.
  • Stage 1 uses structured prompting with format validation and self-correction to automatically generate relaxation and complementary rules.
  • Stage 2 adds LLM-guided failure recovery under a feasibility-gated budget policy that reserves API latency cost before each call to avoid starving downstream fallback mechanisms.
  • Stage 3 replaces a domain-trained GNN object scorer with zero-shot LLM object importance scoring, eliminating the need for any training data.
  • Across MazeNamo benchmarks (10x10 to 15x15), LLM-Flax achieves higher average success rate (SR 0.945 vs 0.828 for the manual baseline) and handles cases where the manual planner fails, though scalability is limited by context-window constraints.

Abstract

Deploying a neuro-symbolic task planner on a new domain today requires significant manual effort: a domain expert must author relaxation and complementary rules, and hundreds of training problems must be solved to supervise a Graph Neural Network (GNN) object scorer. We propose LLM-Flax, a three-stage framework that eliminates all three sources of manual effort using a locally hosted LLM given only a PDDL domain file. Stage 1 automatically generates relaxation and complementary rules via structured prompting with format validation and self-correction. Stage 2 introduces LLM-guided failure recovery with a feasibility-gated budget policy that explicitly reserves API latency cost before each LLM call, preventing the downstream relaxation fallback from being starved. Stage 3 replaces the domain-trained GNN entirely with zero-shot LLM object importance scoring, requiring no training data. We evaluate all three stages on the MazeNamo benchmark across 10x10, 12x12, and 15x15 grids (8 benchmarks total). LLM-Flax achieves average SR 0.945 versus the manual baseline's 0.828 (+0.117), matching or outperforming manual rules on every one of the eight benchmarks. On 12x12 Expert, LLM-Flax attains SR 0.733 where the manual planner fails entirely (SR 0.000); on 15x15 Hard, it achieves SR 1.000 versus Manual's 0.900. Stage 3 demonstrates feasibility (SR 0.720 on 12x12 Hard with no training data) but faces a context-window bottleneck at scale, pointing to the primary open challenge for future work.