CresOWLve: Benchmarking Creative Problem-Solving Over Real-World Knowledge

arXiv cs.CL / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CresOWLve, a new benchmark designed to measure creative problem-solving by using puzzles grounded in real-world knowledge rather than contrived brainteasers.
  • CresOWLve aims to better reflect real creative workflows by requiring multiple cognitive strategies, cross-domain knowledge retrieval, and the creative recombination of facts.
  • Experiments on several frontier “thinking” and “non-thinking” LLMs show that the benchmark remains highly challenging overall.
  • Results indicate a consistent performance gap, with models answering factual questions substantially better than creative ones, including drops of up to about 17%.
  • The analysis suggests that while models can often retrieve relevant information, they struggle to make the non-obvious connections needed to integrate knowledge and produce correct creative solutions.

Abstract

Creative problem-solving requires combining multiple cognitive abilities, including logical reasoning, lateral thinking, analogy-making, and commonsense knowledge, to discover insights that connect seemingly unrelated pieces of information. However, most existing benchmarks for large language models (LLMs) evaluate only specific components of this process. Moreover, many creativity-oriented benchmarks rely on artificially constructed brainteasers or contrived scenarios that do not reflect how creative problem-solving occurs in real-world settings. To address this gap, we introduce CresOWLve, a benchmark for evaluating creative problem-solving using puzzles grounded in real-world knowledge. Problems in CresOWLve require employing multiple creative thinking strategies, retrieving facts from diverse domains, and creatively combining them to arrive at a solution. Evaluating several frontier non-thinking and thinking LLMs, we show that CresOWLve remains highly challenging. Our analysis reveals a consistent performance gap: models perform substantially better on factual questions than on creative ones (up to a -17% drop). While models can often retrieve the relevant knowledge, they struggle to form the non-obvious creative connections required to integrate this information and arrive at the correct answer.