To Write or to Automate Linguistic Prompts, That Is the Question

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study presents a first systematic comparison of hand-crafted expert zero-shot prompts versus automatic prompt optimization using DSPy signatures, including GEPA-optimized variants, across translation, terminology insertion, and language quality assessment (LQA).
  • Results are highly task-dependent: terminology insertion shows mostly no statistically meaningful quality difference between optimized and manual prompts, while translation and LQA exhibit different winners depending on the model configuration.
  • For translation, different prompt approaches outperform on different models, suggesting no universal prompting strategy for all linguistic tasks.
  • In LQA, expert prompts tend to achieve stronger error detection, but GEPA optimization improves model characterization, indicating distinct strengths between manual expertise and automated search.
  • Overall, GEPA can elevate minimal DSPy signatures, and most expert-optimized comparisons show no statistically significant difference; the work also highlights an asymmetric setup where GEPA relies on programmatic search over gold splits, while expert prompts can be done without labeled data via iterative refinement.

Abstract

LLM performance is highly sensitive to prompt design, yet whether automatic prompt optimization can replace expert prompt engineering in linguistic tasks remains unexplored. We present the first systematic comparison of hand-crafted zero-shot expert prompts, base DSPy signatures, and GEPA-optimized DSPy signatures across translation, terminology insertion, and language quality assessment, evaluating five model configurations. Results are task-dependent. In terminology insertion, optimized and manual prompts produce mostly statistically indistinguishable quality. In translation, each approach wins on different models. In LQA, expert prompts achieve stronger error detection while optimization improves characterization. Across all tasks, GEPA elevates minimal DSPy signatures, and the majority of expert-optimized comparisons show no statistically significant difference. We note that the comparison is asymmetric: GEPA optimization searches programmatically over gold-standard splits, whereas expert prompts require in principle no labeled data, relying instead on domain expertise and iterative refinement.
広告