Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

arXiv cs.CL / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper investigates whether natural-language, post-hoc explanations for LLM decisions are epistemically faithful—i.e., whether they reflect the internal evidence the model actually used—rather than merely looking convincing.
Using counterfactual evaluation, it finds that LLM-generated textual explanations are often unfaithful to the model’s true decision evidence.
It proposes “Faithfulness Serum,” a training-free approach that improves explanation faithfulness by applying attention-level interventions during explanation generation.
The method uses token-level heatmaps derived from a faithful attribution technique to guide the model toward explanations aligned with the relevant internal signals.
Experiments show significant gains in epistemic faithfulness across multiple LLMs, benchmarks, and prompting setups.

Abstract

Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.