AI Navigate

From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study uses factorial design to examine how forecasting model choice, XAI method, LLM selection, and prompting strategy affect the quality of natural language explanations.
  • It spans four models (XGBoost, Random Forest, MLP, SARIMAX), three XAI conditions (SHAP, LIME, and no-XAI), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies, evaluating 660 explanations for time-series forecasting with G-Eval using dual LLM judges and four criteria.
  • Results show that XAI provides only small improvements over no-XAI, and mainly for expert audiences, while LLM choice dominates all factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3.
  • The study reveals an interpretability paradox: SARIMAX yields lower NLE quality than ML models despite higher accuracy; zero-shot prompting is competitive with self-consistency at 7x lower cost, and chain-of-thought hurts rather than helps.

Abstract

Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.