1. Data-free Knowledge distillation
Knowledge distillation: Dealing with the problem of training a smaller model (Student) from a high capacity source model (Teacher) so as to retain most of its performance.
As the word itself, We perform knowledge distillation when there is no original dataset on which the Teacher network has been trained. It is because, in real-world, most datasets are proprietary and not shared publicly due to privacy or confidentiality concerns.
In order to perform data-free knowledge distillation, it is necessary to reconstruct a dataset for training Student network. Thus, in Zero-Shot Knowledge Transfer via Adversarial Belief Matching, we train an adversarial generator to search for images on which the student poorly matches the teacher, and then use them to train the student.
2. Zero-shot knowledge transfer
2.1 Problem definition
- Teacher network:
with an input image- Probability vector of teacher network:
- Probability vector of teacher network:
- Student network:
with weigths- Probability vector of student network:
- Probability vector of student network:
- Generator:
with weights- Pseudo data:
from a noise vector
- Pseudo data:
2.2 Method
The goal is to produce pseudo data from generator and use them to train student network by knowledge distillation.
To do this, Our zero-shot training algorithm is described in Algorithm 1. For
: Kullback-Leibler (KL) divergence between outputs of the teacher and student netowkrs on pseudo data ( is image classes)
| If maximize
| Elif minimize
We then take
2.2 Extra loss functions
The high student entropy is a vital component of our method since it makes it hard for the generator to fool the student easily. Then, since many student-teacher pairs have similar block structures, we can add an attention term to the student loss as follows:
- Hyperparameter:
- Total layers:
- Teacher and student activation blocks:
and for layer- Total channels of l-th layer:
- Total channels of l-th layer:
- Spatial attention map:
- We take the sum over some subset of
layers. The second term encourages both spatial attention maps between teacher and student networks to be similar. We don't use attention to the generator loss because it makes it too easy to fool the student. The training procedure is described in Fig. 1.

2.3 Toy experiment
The dynamics of our algorithm are illustrated in Fig. 2, where we use two layers MLPs for both teacher and student, and learn the pseudo points directly. These are initialized away from the real data manifold.
During training, pseudo points can be seen to explore the input space, typically running along decision boundaries where the student is most likely to match the teacher poorly. At the same time, the student is trained to match the teacher on the pseudo points, and so they must keep changing locations. When the decision boundaries between student and teacher are well aligned, some pseudo points will naturally depart from them and search for new high teacher mismatch regions, which allows disconnected decision boundaries to be explored as well.

3. Experiments & Results
For each experiment, we run three seeds and report the mean with one standard deviation. The experiment setting is described in Fig. 3 (a).
3.1 CIFAR-10 and SVHN
We focus our experiments on two common datasets, SVHN and CIFAR-10. For both datasets, we use WideResNet (WRN) architecture. Our distillation results are shown in Fig. 3 (b). We include the few-shot performance of our method as a comparison, by naively finetuning our zero-shot model with
3.2 Architecture dependence
We observe that some teacher-student pairs tend to work better than others, as in the case of few-shot distillation. The comparison results are shown in Fig. 3 (c). In zero-shot, deep students with more parameters don't necessarily help: the WRN-40-2 teacher distills 3.1% better to WRN-16-2 than to WRN-40-1, even though WRN-16-2 has less than half number of layers and a similar parameter count than WRN-40-1.
3.3 Nature of the pseudo data
Samples from the generator during training are shown in Fig. 3 (d). We notice that early in training the samples look like coarse textures and are reasonably diverse. After about 10% of the training run, most images produced by generator look like high-frequency patterns that have little meaning to humans.

3.4 Measuring belief match near decision boundaries
We would like to verify that the student is implicitly trained to match the teacher's predictions close to decision boundaries. For this, in Algorithm 2, we propose a way to probe the difference between beliefs of network
- Sampling a real image:
from the test set such that network and both give the same class prediction . - For each class
we update by taking adversarial steps on network , with learning rate , to go from class $i$ to class $j$. - The probability
of belonging to class according to network quickly reduces, with a concurrent increase in . - During 3. of process, we also record
and compare ad .
Consequently, we are asking the following question, as we perturb
We refer to
The result is particularly surprising because while updating images to move from class
We can more explicitly quantify the belief match between networks
The mean transition errors are reported in Fig. 4 (b).

Reference
Micaelli, Paul, and Amos J. Storkey. "Zero-shot knowledge transfer via adversarial belief matching." Advances in Neural Information Processing Systems 32 (2019).
'AI paper review > Model Compression' 카테고리의 다른 글
EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning (0) | 2022.03.10 |
---|---|
Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN (0) | 2022.03.09 |
Dreaming to Distill Data-free Knowledge Transfer via DeepInversion (0) | 2022.03.09 |
Data-Free Learning of Student Networks (0) | 2022.03.08 |
Zero-Shot Knowledge Distillation in Deep Networks (0) | 2022.03.08 |
- 1. Data-free Knowledge distillation
- 2. Zero-shot knowledge transfer
- 2.1 Problem definition
- 2.2 Method
- 2.2 Extra loss functions
- 2.3 Toy experiment
- 3. Experiments & Results
- 3.1 CIFAR-10 and SVHN
- 3.2 Architecture dependence
- 3.3 Nature of the pseudo data
- 3.4 Measuring belief match near decision boundaries
- Reference