From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
arXiv cs.CL / 3/16/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study uses factorial design to examine how forecasting model choice, XAI method, LLM selection, and prompting strategy affect the quality of natural language explanations.
- It spans four models (XGBoost, Random Forest, MLP, SARIMAX), three XAI conditions (SHAP, LIME, and no-XAI), three LLMs (GPT-4o, Llama-3-8B, DeepSeek-R1), and eight prompting strategies, evaluating 660 explanations for time-series forecasting with G-Eval using dual LLM judges and four criteria.
- Results show that XAI provides only small improvements over no-XAI, and mainly for expert audiences, while LLM choice dominates all factors, with DeepSeek-R1 outperforming GPT-4o and Llama-3.
- The study reveals an interpretability paradox: SARIMAX yields lower NLE quality than ML models despite higher accuracy; zero-shot prompting is competitive with self-consistency at 7x lower cost, and chain-of-thought hurts rather than helps.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to

YouTube's Deepfake Shield for Politicians Changes Evidence Forever
Dev.to