Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models

arXiv cs.CL / 4/21/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces SemanticQA, a new evaluation suite for testing language models on semantic phrase processing tasks.
  • SemanticQA consolidates and reorganizes existing multiword expression (MwE) resources into a unified benchmark spanning lexical collocations plus three detailed categories: idioms, noun compounds, and verbal constructions.
  • The benchmark evaluates multiple model architectures and sizes across extraction, classification, interpretation, and sequentially composed tasks to examine end-to-end semantic reasoning.
  • Results show significant performance disparities, especially on semantic reasoning tasks, indicating that models differ in their reasoning ability and semantic understanding of complex phrases.
  • The authors provide the evaluation harness and dataset publicly via GitHub to support further research into stronger comprehension for non-trivial semantic phrases.

Abstract

We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.