1. What is the goal of GradCAM??
The goal of GradCAM is to produce a coarse localization map highlighting the important regions in the image for predicting the concept (class).
GradCAM uses the gradients of any target concept (such as "cat") flowing into the final convolutional layer.
Note: I (da2so) will only deal with the problem of image classification in the following contents.
The property of feature map \( A^k \) from the last convolution layer: We expect the last convolution layer to have the best comprise between high-level semantics and detailed spatial information.
Obtaining the neuron importance weights \( w^{c}_k=\frac{1}{z}\sum_i\sum_j\frac{\partial y^c}{\partial A^{k}_V} \), where \( V \) is \(i, j\).
This weight represents a partial linearization of the deep network downstream from \( A \), and captures the 'importance' of feature map \(k\) for a target class \(c\).
Then, we perform a weighted combination of forward activation maps and follow it with a ReLU.
\[\text{Grad-CAM} \quad L^c=ReLu(\sum_k w^c_k A^k) \quad \quad \cdots Eq.(1)\]
The reason for applying ReLU is that we are only interested in the features that have a positive influence.
In summary, the procedure of GradCAM is followed.
- Input
- Image: \(x\)
- Pre-trained model: \(f\)
- Feature extractor (CNN): \(f_e\)
- Classification layer (fc layer): \(f_l\)
- Category (target class): \(c\)
- \(A \leftarrow f_e(x)\)
- \(y^c \leftarrow f_l(A)\)
- \(w^c_k \leftarrow \frac{1}{z}\sum_i\sum_j\frac{\partial y^c}{\partial A^{k}_V}\)
- \(L^c \leftarrow ReLu(\sum_k w^c_k A^k)\)
- Output
- Grad-CAM: \(L^c\)
Evaluating Trust
Given two prediction explanations, they evaluate which seems more trustworthy between Guided Backpropagation and Guided Grad-CAM visualizations. For experiments, they use AlexNet and VGG-16oting that VGG-16 has more accurate than AlexNet with an accuracy of 79.09 mAP (vs. 69.20 mAP) on PASCAL classification. Trust scores are obtained from 54 humans. With Guided Backpropagation, humans assign VGG-16 an average score of 1.00 which means that it is more trustworthy than AlexNet, while Guided Grad-CAM achieves a higher score of 1.27 which means that VGG-16 is clearly more reliable.
Reference
Selvaraju, Ramprasaath R., et al. "Grad-cam: Visual explanations from deep networks via gradient-based localization." Proceedings of the IEEE international conference on computer vision. 2017.
Github Code: Grad-CAM