Evergreen: Efficient Claim Verification for Semantic Aggregates

arXiv cs.AI / 4/30/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Evergreen, a system for efficiently verifying claims inside LLM-generated semantic aggregates that may not be grounded in the source data.
  • Evergreen converts each claim into a declarative semantic verification query executed on the same semantic query engine, using optimizations like early stopping, relevance sorting, confidence-sequence-based estimation, operator fusion, similarity filtering, and prompt caching.
  • It outputs verdicts with citations by capturing provenance via semiring provenance for first-order logic, aiming to justify results with a minimal set of supporting tuples.
  • Experiments on production-inspired restaurant review benchmarks show Evergreen reaches perfect verification quality (F1=1.00) with a strong LLM, cutting verification cost by 3.2× and latency by 4.0× versus unoptimized approaches.
  • With weaker LLMs, Evergreen still delivers strong performance, beating an LLM-as-a-judge baseline and achieving the same F1 at dramatically lower cost and latency compared with retrieval-augmented agents.

Abstract

With recent semantic query processing engines, semantic aggregation has become a primitive operator, enabling the reduction of a relation into a natural language aggregate using an LLM. However, the resulting semantic aggregate may contain claims that are not grounded in the underlying relation. Verifying such claims is challenging: they often involve quantifiers, groupings, and comparisons over relations that far exceed LLM context windows and require a costly combination of semantic and symbolic processing. We present Evergreen, a system that recasts claim verification as a semantic query processing task with tailored optimizations and provenance capture. Evergreen compiles each claim into a declarative semantic verification query and executes it on the same engine that produced the aggregate. To reduce cost and latency, Evergreen avoids unnecessary LLM calls through verification-aware optimizations (early stopping, relevance sorting, and estimation with confidence sequences) and general-purpose optimizations for semantic queries (operator fusion, similarity filtering, and prompt caching). Each verdict is accompanied by citations that identify a minimal set of tuples justifying the result, with semantics based on semiring provenance for first-order logic. On a benchmark of real-world restaurant review datasets reflecting production-inspired workloads, Evergreen achieves excellent verification quality (F1 = 1.00) with a strong LLM while reducing cost by 3.2x and latency by 4.0x compared to unoptimized verification. Even with a significantly weaker LLM, Evergreen outperforms a strong LLM-as-a-judge baseline in F1 at 48x lower cost and 2.3x lower latency. Relative to a retrieval-augmented agent, Evergreen compares favorably in F1 and latency with similar cost when both use a strong LLM; yet, with a much weaker LLM, it achieves the same F1 at 63x lower cost and 4.2x lower latency.