PolicyLLM: Towards Excellent Comprehension of Public Policy for Large Language Models

arXiv cs.CL / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PolicyBench, a large-scale cross-system US–China benchmark (21K cases) designed to evaluate how well large language models comprehend and reason about public-policy content.
  • It assesses three policy-related capabilities—memorization, understanding, and application—grounded in Bloom’s taxonomy to cover both knowledge recall and real-world scenario reasoning.
  • The work proposes PolicyMoE, a domain-specialized Mixture-of-Experts model with expert modules aligned to the different cognitive levels tested by the benchmark.
  • Results show LLMs perform relatively better on application-oriented policy tasks than on pure memorization or conceptual understanding, with the strongest accuracy on structured reasoning tasks.
  • The authors identify current limitations in policy understanding and outline directions for building more reliable, policy-focused LLM systems.

Abstract

Large Language Models (LLMs) are increasingly integrated into real-world decision-making, including in the domain of public policy. Yet, their ability to comprehend and reason about policy-related content remains underexplored. To fill this gap, we present \textbf{\textit{PolicyBench}}, the first large-scale cross-system benchmark (US-China) evaluating policy comprehension, comprising 21K cases across a broad spectrum of policy areas, capturing the diversity and complexity of real-world governance. Following Bloom's taxonomy, the benchmark assesses three core capabilities: (1) \textbf{Memorization}: factual recall of policy knowledge, (2) \textbf{Understanding}: conceptual and contextual reasoning, and (3) \textbf{Application}: problem-solving in real-life policy scenarios. Building on this benchmark, we further propose \textbf{\textit{PolicyMoE}}, a domain-specialized Mixture-of-Experts (MoE) model with expert modules aligned to each cognitive level. The proposed models demonstrate stronger performance on application-oriented policy tasks than on memorization or conceptual understanding, and yields the highest accuracy on structured reasoning tasks. Our results reveal key limitations of current LLMs in policy understanding and suggest paths toward more reliable, policy-focused models.