Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses cross-domain task-oriented dialogue where an agent must reason about implicit/explicit feasibility constraints while planning long-horizon, multi-turn actions.
  • It argues that simply combining LLMs with reinforcement learning (RL) is brittle because unverified LLM outputs can corrupt state representations and mislead policy learning.
  • To fix this, it proposes VLK-RL, a hybrid framework that elicits candidate constraints with an LLM and then verifies them using a dual-role cross-examination procedure to reduce hallucinations and inconsistencies.
  • Verified constraints are converted into ontology-aligned slot-value representations, enabling RL to optimize with a structured, constraint-aware state.
  • Experiments on multiple benchmarks show VLK-RL improves generalization and robustness and outperforms strong single-model baselines on long-horizon tasks.

Abstract

Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.