Towards Verified and Targeted Explanations through Formal Methods

arXiv cs.LG / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current explainable AI (XAI) techniques often provide feature attributions without formal guarantees about how decision boundaries behave under perturbations.
It highlights that safety-critical misclassifications have different real-world severities, motivating explanations that are targeted toward user-specified critical alternatives.
The authors introduce ViTaX (Verified and Targeted Explanations), a formal XAI framework that produces targeted semifactual explanations and verifies them using mathematical reachability analysis.
ViTaX identifies the smallest sensitive subset of features for the transition from an original class y to a user-chosen target class t, then guarantees that perturbing those features within epsilon cannot cause the prediction to flip to t.
Experiments on MNIST, GTSRB, EMNIST, and TaxiNet show more than 30% improvement in fidelity while using minimal explanation cardinality.

Abstract

As deep neural networks are deployed in safety-critical domains such as autonomous driving and medical diagnosis, stakeholders need explanations that are interpretable but also trustworthy with formal guarantees. Existing XAI methods fall short: heuristic attribution techniques (e.g., LIME, Integrated Gradients) highlight influential features but offer no mathematical guarantees about decision boundaries, while formal methods verify robustness yet remain untargeted, analyzing the nearest boundary regardless of whether it represents a critical risk. In safety-critical systems, not all misclassifications carry equal consequences; confusing a "Stop" sign for a "60 kph" sign is far more dangerous than confusing it with a "No Passing" sign. We introduce ViTaX (Verified and Targeted Explanations), a formal XAI framework that generates targeted semifactual explanations with mathematical guarantees. For a given input (class y) and a user-specified critical alternative (class t), ViTaX: (1) identifies the minimal feature subset most sensitive to the y->t transition, and (2) applies formal reachability analysis to guarantee that perturbing these features by epsilon cannot flip the classification to t. We formalize this through Targeted epsilon-Robustness, certifying whether a feature subset remains robust under perturbation toward a specific target class. ViTaX is the first method to provide formally guaranteed explanations of a model's resilience against user-identified alternatives. Evaluations on MNIST, GTSRB, EMNIST, and TaxiNet demonstrate over 30% fidelity improvement with minimal explanation cardinality.