1. What is data-free knowledge distillation??
Knowledge distillation: Dealing with the problem of training a smaller model (Student) from a high capacity source model (Teacher) so as to retain most of its performance.
As the word itself, We perform knowledge distillation when there is no original dataset on which the Teacher network has been trained. It is because, in real-world, most datasets are proprietary and not shared publicly due to privacy or confidentiality concerns.
In order to perform data-free knowledge distillation, it is necessary to reconstruct a dataset for training Student network. Then, in this paper, we propose "Zero-Shot Knowledge Distillation" (ZSKD), which performs pseudo data synthesis from the Teacher model that acts as the transfer set to perform the distillation without even using any meta-data.

2. Method
2.1 Knowledge Distillation
Transferring the generalization ability of a large, complex Teacher
Let
Knowledge distillation methods train the Student by minimizing the following objective ((L)).
,where
2.2 Modelling the Data in Softmax Space
In Zero-Shot Knowledge Distillation in Deep Networks, we deal with the scenario where we have no access to (i) any training data samples (either from the target distribution or different) (ii) meta-data extracted from it.
To tackle this, our approach taps the learned parameters of the Teacher and produces synthesized input representations, named Data Impressions (DIs), from the underlying data distribution on which it is trained. These can be used as a transfer set in order to perform knowledge distillation to a Student model.
In order to craft the Data impressions, we model the output space of the Teacher model. Let
Dirichlet distribution:
The distribution to represent the softmax output
2.2.1 Concentration Parameter ( )
Concentration parameter
| If
| Elif
So, it is important to determine right
Thus, we resort to the Teacher network for extracting this information. We compute a normalized class similarity matrix (

2.2.2 Class Similarity Matrix ( )
The weights
|If pre-final layer's output is positive scaled version of
|Elif pre-final layer's output is misaligned with the
Therefore, we treat the weights
Since the elements of the concentration parameter have to be positive real numbers, we perform a min-max normalization over each row of the class similarity matrix.
2.3 Crafting Data Impression via Dirichlet Sampling
Once the parameters
We initialize
2.3.1 Scaling Factor ( )
The probability density function of the Dirichlet distribution for

Thus, we define a scaling vector
|If small value of
|Elif large value of
2.4 Zero-Shot Knowledge Distillation
We treat Data Impressions as the 'Transfer set' and perform knowledge distillation as follows.
We ignore the cross-entropy loss

3. Experiment Setting & Result

Reference
Nayak, Gaurav Kumar, et al. "Zero-Shot Knowledge Distillation in Deep Networks." International Conference on Machine Learning. 2019.
Github Code: Zero-Shot Knowledge Distillation in Deep Networks
'AI paper review > Model Compression' 카테고리의 다른 글
EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning (0) | 2022.03.10 |
---|---|
Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN (0) | 2022.03.09 |
Dreaming to Distill Data-free Knowledge Transfer via DeepInversion (0) | 2022.03.09 |
Zero-Shot Knowledge Transfer via Adversarial Belief Matching (0) | 2022.03.08 |
Data-Free Learning of Student Networks (0) | 2022.03.08 |