STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces STELLAR-E, a fully automated system that generates high-quality synthetic evaluation datasets tailored to specific domains and languages without relying on existing data sources.
  • STELLAR-E operates in two stages: it modifies the TGRT Self-Instruct framework to produce controllable synthetic datasets, then runs an evaluation pipeline using both statistical and LLM-based metrics.
  • The synthetic datasets achieve an average +5.7% improvement in LLM-as-a-judge scores versus existing language-specific benchmarks, indicating comparable quality for evaluating both large and smaller LLMs.
  • The authors note that real datasets remain somewhat harder for LLMs, particularly smaller models, but the approach provides a scalable and domain-adaptable benchmarking framework for faster, fairer evaluation workflows.
  • By reducing privacy/regulatory barriers and manual time costs, STELLAR-E aims to enable high-efficiency automated quality assurance cycles for LLM application evaluation.

Abstract

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.