From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
arXiv cs.CL / 3/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study uses factorial design to examine how forecasting model choice, XAI method, LLM selection, and prompting strategy affect the quality of natural language explanations.
- It spans four models (XGBoost, Random Forest, MLP, SARIMAX), three XAI conditions (SHAP, LIME, and no-XAI), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies, evaluating 660 explanations for time-series forecasting with G-Eval using dual LLM judges and four criteria.
- Results show that XAI provides only small improvements over no-XAI, and mainly for expert audiences, while LLM choice dominates all factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3.
- The study reveals an interpretability paradox: SARIMAX yields lower NLE quality than ML models despite higher accuracy; zero-shot prompting is competitive with self-consistency at 7x lower cost, and chain-of-thought hurts rather than helps.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)
Dev.to
The Obligor
Dev.to
The Markup
Dev.to
2026 年 AI 部落格變現完整攻略:從第一篇文章到月收入 $1000
Dev.to