1. Counterfactual Explanation
Counterfactual Explanation: Given input data that are classified as a class from a deep network, it is to perturb the subset of features in the input data such that the model is forced to predict the perturbed data as a target class.
The Framework for counterfactual explanation is described in Fig 1.

From perturbed data, we can interpret that the pre-trained model thinks the perturbed parts(regions) as the discriminative features between the original and target classes, such as Fig 2.

For this, the perturbed data for counterfactual explanation should satisfy two desirable properties.
- Explainability
- A generated explanation should be naturally understood by humans.
- Minimality
- Only a few features should be perturbed.
2. Method
To generate counterfactual explanations, we propose a counterfactual explanation method based on gradual construction that considers the statistics learned from training data. We particularly generate counterfactual explanations by iterating over masking and composition steps.

2.1 Problem definition
- Input (original) data:
- Its predicted class :
under a pre-trained model
- Its predicted class :
- Perturbed data
- Binary mask:
- Composite:
- Binary mask:
- Target class:
- Desired classification score for the target class:
The mask
To produce perturbed data
2.2 Masking step
The goal of the masking step is to select the most influential feature to
produce a target class from a pre-trained network as follows:
where
Suppose
Since the
The
Thus, we choose a sub-optimal idndex as
In summary, each masking step selects an index in the descending order by calculating Eq. (5) and changes the zero value of mask
2.3 Composite step
After selecting the input feature to be modified, the composition step optimizes the feature value to ensure that the deep network classifies the perturbed data
where
However, this objective function causes an adversarial attack such as Failure images in Fig. 3. Then, we compared the contributions of logit scores (before the softmax layer) for each failure case and the training images that are classified as

Thus, we regard failure cases as the result of an n inappropriate objective function that maps the perturbed data onto a different logit space from the training data. To solve this problem, we instead force the logit space of
where
As a result, Eq. (6) makes the composite
Overall, gradual construction iterates over the masking and composition steps until the classification probability of a target class is reached to a hyperparameter
We present a pseudo-code in Algorithm 1.

3. Experiment


Reference
Kang, Sin-Han, et al. "Counterfactual Explanation Based on Gradual Construction for Deep Networks." arXiv preprint arXiv:2008.01897 (2020).
Github Code: Counterfactual Explanation Based on Gradual Construction for Deep Networks
'AI paper review > Explainable AI' 카테고리의 다른 글
A Disentangling Invertible Interpretation Network for Explaining Latent Representations (0) | 2022.03.10 |
---|---|
Interpretable And Fine-grained Visual Explanations For Convolutional Neural Networks (0) | 2022.03.09 |
Interpretable Explanations of Black Boxes by Meaningful Perturbation (0) | 2022.03.07 |
GradCAM (0) | 2022.03.07 |