1. Goal
The goal is to perform Data-free Knowledge distillation.
Knowledge distillation: Dealing with the problem of training a smaller model (Student) from a high capacity source model (Teacher) so as to retain most of its performance.
As the word itself, We perform knowledge distillation when there is no original dataset on which the Teacher network has been trained. It is because, in real world, most datasets are proprietary and not shared publicly due to privacy or confidentiality concerns.
To tackle this problem, it is necessary to reconstruct a dataset for training Student network. Thus, in this paper, the authors propose a data-free knowledge amalgamate strategy to craft a well-behaved multi-task student network from multiple single/multi-task teachers.
For this, the main idea is to construct the group-stack generative adversarial networks (GANs) which have two dual generators. First, one generator is trained to collect the knowledge by reconstructing the images approximating the original dataset. Then, a dual generator is trained by taking the output from the former generator as input. Finally, we treat the dual part generator as the TargetNet (Student network) and regroup it.
The architecture of Dual-GAN is shown in Fig 0.

2. Problem Definiton
In Data-Free Knowledge Amalgamation via Group-Stack Dual-GAN, The authors aim to explore a more effective approach to train the student network (TargetNet), only utilizing the knowledge amalgamated from the pre-trained teachers. The TargetNet is designed to deal with multiple tasks and learns a customized multi-branch network that can recognize all labels selected from separate teachers.
- The number of the customized categories:
- Label vector:
- TargetNet:
- Handling multiple tasks on the
- Handling multiple tasks on the
- Pre-trained teachers:
- For each teacher
, a -label classification task : - Feature maps in the
-th block of the -th pre-trained teacher:
The teacher networks are in the constraint:
3. Method
The process of obtaining the well-behaved TargetNet with the proposed data-free framework contains three steps.
- The generator
is trained with knowledge amalgamation in the adversarial way, where the images in the same distribution of the original dataset can be manufactured. Note that is the random noise and denotes the image. - The dual generator
is trained with generated samples from in the block-wise way to produce multiple predict labels. Note that is the predicted labels. - After training the whole dual-GAN, The dual-generator is modified as TargetNet for classifying the customized label set
The overall training procedure of group-stack GAN is depicted in Fig 1.

3.1 Amalgamating GAN
First, introduce the arbitrary vanilla GAN. The GAN performs a minmax game between a generator
However, since the absence of real data, we can not perform training by Eq. (1). Then, several modification have been made as follows.
3.1.1 Group-stack GAN
The first modification is the group-stack architecture. The generator is designed to generate not only synthesized images but also the intermediated activations aligned with the teachers.
Thus, we set

when
Since both the architecture of generator
where

During the training for the group pair
The output for
where
where
In addition, the outputs need to be sparse, since an image in the real world can't be tagged with dense labels which are the descriptions for different situations. So, an extra discrete loss function is proposed:
which is known as L1-norm loss function. Finally, combining all the losses, the final objective function can be obtained:
where
3.1.2 Multiple Targets
Since the TargetNet is customized to perform multi-label classifications learning from multiple teachers, the generator should generate samples containing multiple targets. As a result, for the
In order to amalgamate multi-knowledge into the generator by
where the filtering function

The for the generated features
which is treated as new input to the loss Eq. 8, then
from the real data, they should also lead to the same predictions from the same input span style="color:DodgerBlue">
where the adversarial loss
By minimizing
3.2 Dual-generator Training
After training the group-stack GAN, a set of generated samples are obtained, which are in the form of
where the dual generator
We divide dual-generator into
Taking training
- Teacher-level filtering for multi-target demand
- Transforming
to teacher streams , as defined in Eq. (9).
- Transforming
- Task-level filtering
- Conducting after the last few fully connected layers of the corresponding discriminator, which is established for the constraint of
.
- Conducting after the last few fully connected layers of the corresponding discriminator, which is established for the constraint of
And, the authors feed the generated features
So, the block-wise loss for updating dual-generator
where
Then according to the different inputs to
Reference
Ye, Jingwen, et al. "Data-free knowledge amalgamation via group-stack dual-gan." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
'AI paper review > Model Compression' 카테고리의 다른 글
Knowledge Distillation via Softmax Regression Representation Learning (0) | 2022.03.10 |
---|---|
EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning (0) | 2022.03.10 |
Dreaming to Distill Data-free Knowledge Transfer via DeepInversion (0) | 2022.03.09 |
Zero-Shot Knowledge Transfer via Adversarial Belief Matching (0) | 2022.03.08 |
Data-Free Learning of Student Networks (0) | 2022.03.08 |