From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

arXiv cs.CL / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study uses factorial design to examine how forecasting model choice, XAI method, LLM selection, and prompting strategy affect the quality of natural language explanations.
It spans four models (XGBoost, Random Forest, MLP, SARIMAX), three XAI conditions (SHAP, LIME, and no-XAI), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies, evaluating 660 explanations for time-series forecasting with G-Eval using dual LLM judges and four criteria.
Results show that XAI provides only small improvements over no-XAI, and mainly for expert audiences, while LLM choice dominates all factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3.
The study reveals an interpretability paradox: SARIMAX yields lower NLE quality than ML models despite higher accuracy; zero-shot prompting is competitive with self-consistency at 7x lower cost, and chain-of-thought hurts rather than helps.

Abstract

Explainable AI (XAI) methods like SHAP and LIME produce numerical feature attributions that remain inaccessible to non expert users. Prior work has shown that Large Language Models (LLMs) can transform these outputs into natural language explanations (NLEs), but it remains unclear which factors contribute to high-quality explanations. We present a systematic factorial study investigating how Forecasting model choice, XAI method, LLM selection, and prompting strategy affect NLE quality. Our design spans four models (XGBoost (XGB), Random Forest (RF), Multilayer Perceptron (MLP), and SARIMAX - comparing black-box Machine-Learning (ML) against classical time-series approaches), three XAI conditions (SHAP, LIME, and a no-XAI baseline), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies. Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting. Our results suggest that: (1) XAI provides only small improvements over no-XAI baselines, and only for expert audiences; (2) LLM choice dominates all other factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3; (3) we observe an interpretability paradox: in our setting, SARIMAX yielded lower NLE quality than ML models despite higher prediction accuracy; (4) zero-shot prompting is competitive with self-consistency at 7-times lower cost; and (5) chain-of-thought hurts rather than helps.

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

Dev.to

The Obligor

Dev.to

The Markup

Dev.to

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

Dev.to

From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality

Key Points

Abstract

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

The Obligor

The Markup

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer