Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

arXiv cs.AI / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The study argues that LLM-based natural-language analytics fail due to a shared root cause: the model must infer business semantics that the database schema does not encode, leading to both wrong answers and confident hallucinations.
  • It benchmarks three frontier models (Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.4) on 100 questions using ClickHouse over the Cleaned Contoso Retail Dataset, comparing schema-only prompting versus schema plus a 4KB hand-authored “semantic layer” markdown document.
  • Adding the semantic-layer document boosts accuracy by about +17 to +23 percentage points across all models, reducing the hallucination-prone behavior by grounding interpretation in explicit definitions.
  • After adding the document, all three models perform similarly (67.7–68.7%); without it they are also similar (45.5–50.5%), and all cross-cluster comparisons are significant at p < 0.01.
  • The authors conclude the key driver is the explicit business-semantics input itself: it changes the task the model is asked to perform, suppressing the dominant text-to-SQL error mode more than differences in model capability or model selection.

Abstract

LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap. We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse, using a paired single-shot protocol. Each model is evaluated twice: once given only the warehouse schema, and once given the schema plus a 4 KB hand-authored markdown document describing the dataset's measures, conventions, and disambiguation rules. Adding the document improves accuracy by +17 to +23 percentage points across all three models. With it, the three models are statistically indistinguishable (67.7-68.7%); without it, they are also indistinguishable (45.5-50.5%). Every cross-cluster comparison is significant at p < 0.01. The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not. We interpret this as a structural result: explicit business semantics suppress the dominant class of text-to-SQL errors not by making the model more capable, but by changing what the model is being asked to do.