A Comprehensive Overhaul of Feature Distillation (ICCV 2019)

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, Jin Young Choi

arXiv Github Project Page

Knowledge Distillation in a Nutshell

The general process of knowledge distillation

Knowledge distillation denotes a method that a small model is trained to mimic a pre-trained large model by passing the data from it to the small model (student model), under the supervision of a large model (teacher model). During the distillation process, knowledge is transferred from the teacher model to the student, with the minimization of a loss function in which the target is the distribution of class probabilities predicted by the teacher model. Even though the structural difference between the teacher and the student network exists, unlike other network compression methods, it can downsize a network. As it allows for architectural flexibility, knowledge distillation is emerging as a next-generation approach of model compression.

Our Proposed Method

In this research, we aim to propose a new feature distillation loss with improved performance, with the loss function designed via the investigation of the following design aspects: teacher transform, student transform, distillation feature position, and distance function. The concept is to enhance feature distillation performance by discriminating the beneficial information from the adverse counterpart.

We then introduce a new ReLU function, change the distillation feature position to the front of ReLU, and use a partial L2 distance function to skip the distillation of adverse information, significantly improving the performance of feature distillation. In our experiments, we have evaluated our proposed method in various domains, including classification (CIFAR, ImageNet), object detection (PASCAL VOC) and semantic segmentation (PASCAL VOC).

Last but not least, we analyze the design aspects of feature distillation methods that achieve network compression and suggest an original method of feature distillation that the distillation loss is designed to make a synergy among the following aspects: teacher transform, student transform, distillation feature position and distance function. Our new distillation loss function includes a feature transform with a newly-designed margin ReLU, a new distillation feature position, and a partial L2 distance function to skip redundant information giving adverse effects to the compression of the student.

The Superiority of Our Knowledge Distillation Method to Existing State-of-the-Art Methods

MethodTeacher TransformStudent TransformDistillation Feature PositionDistanceMissing Information
FitNets 1None1×1 convMid layerL2None
AT 2AttentionAttentionEnd of groupL2Channel dims
FSP 3CorrelationCorrelationEnd of groupL2Spatial dims
Jacobian 4GradientGradientEnd of groupL2Channel dims
FT 5Auto-encoderAuto-encoderEnd of groupL1Auto-encoded
AB 6Binarization1×1 convPre-ReLuMarginal L2Feature values
ProposedMargin ReLu1×1 convPre-ReLuPartial L2Negative features

Note that most knowledge distillation methods use teacher transform and that implementing teacher transform incurs information loss.

On a mobile device, slide right to see more.

1 Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In International Conference on Learning Representations (ICLR), 2015.
2 Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In International Conference on Learning Representations (ICLR), 2017.
3 Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
4 Suraj Srinivas and François Fleuret. Knowledge transfer with jacobian matching. In International Conference on Machine Learning (ICML), 2018.
5 Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems (NIPS), 2018.
6 Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In AAAI Conference on Artificial Intelligence (AAAI), 2019.

We have evaluated the efficiency of our distillation method in several domains. The first task is the classification problem, which is fundamental in machine learning. As most distillation methods have reported their performance under this domain, we also have compared our results to others’ results. The performance of knowledge distillation depends on which network architecture and what kind of training scheme is used, and how well the teacher performs. To control other factors and make a fair comparison, we reproduced the algorithms of other methods based on their codes and papers.

Let us start by comparing our proposed method with existing state-of-the-art knowledge distillation methods: KD, FitNets, AT, Jacobian, FT, and AB.

Introduced by Hinton et al.*, KD refers to the knowledge distillation method that uses the softmax output of the teacher network. This method can be applied to any pair of network architectures, as the dimensions of both outputs are the same. However, the output of a high-performance teacher network is not significantly different from the ground truth. Thus, transferring only the output is similar to training the student with the ground truth, eventually making the performance of output distillation limited. Several approaches have been proposed to make better use of the information contained in the teacher network, for feature distillation instead of output distillation.

* Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

FitNets has introduced a method that encourages a student network to mimic the hidden feature values of a teacher network. Although this approach was novel, its performance was not improved significantly.
Since then, various papers have introduced state-of-the-art feature distillation methods, as follows. The methods to transform the feature into a representation having a reduced dimension and transfer it to the student. In spite of the reduced dimension, it has been noted that the abstracted feature representation does lead to improved performance.
Other state-of-the-art methods (FT, AB) have been proposed to increase the amount of transferred information in distillation. FT encodes the feature into a ‘factor’ using an auto-encoder to alleviate information leakage. AB focuses on the activation of a network with only the sign of features being transferred. Both methods show a better distillation performance by increasing the amount of transferred information. However, FT and AB deform feature values of the teacher, which leaves further room for the performance to be improved.

In ImageNet, our proposed method achieves 21.65% of the top-1 error with ResNet50, outperforming even the teacher network, ResNet152. Our proposed method is evaluated on various tasks such as image classification, object detection, and semantic segmentation and achieves a significant performance improvement in all tasks.

Our Proposed Method in Detail

Let’s take a closer look at the design aspects of feature distillation methods achieving network compression and present novel aspects of our approach distinctive to the other methods. From a general form of feature distillation loss function, we consider a few design aspects and make some changes to the original loss function.

Ft and Fs denote the feature of the teacher network and student network, respectively. To match the feature dimension, Tt, and Ts each, we need to make some modifications to the feature Ft and Fs. A distance d between the transformed features is used as a loss function Ldistill. In other words, the general loss function of feature distillation is the same as the following:

L_{distill} = d({T_t}(\boldsymbol{F}_t), T_s(\boldsymbol{F}_s)).*

The student network is trained by minimizing the distillation loss Ldistill.
It is desirable to design the distillation loss to transfer all feature information without missing any vital information from the teacher. To achieve the aforementioned purpose, we design a new feature distillation loss in which all of the critical teacher’s information is transferred as much as possible to improve the distillation performance. To get an idea of this purpose, we analyze the design aspects of feature distillation loss: teacher transform, student transform, distillation feature position, and distance function.

Let us see how each aspect can affect the new loss function.

The Overall framework of our proposed method

  1. Teacher transform
    Tt converts the teacher’s hidden features into an easy-to-transfer form. It is an essential part of feature distillation and the leading cause of the missing information in distillation. Since features include both adverse and beneficial information, it is crucial to distinguish adverse information from the beneficial information and avoid missing the beneficial one. In the proposed method, we use a new ReLU activation, called margin ReLU, for the teacher transform. Upon implementation of our margin ReLU, the positive (beneficial) information is used without any modifications while the negative (adverse) information is suppressed. As a result, the proposed method can perform distillation without missing beneficial information.
  2. Student transform
    The same function is applied to the student transform Ts as it is applied to the teacher transform Tt. In our method, we use this asymmetric format of transformation as the student transform.
  3. Distillation feature position
    Besides the types of feature transformation, we should be careful in picking the location in which distillation occurs. ReLU allows beneficial information (positive) to pass through and filters out adverse information (negative). Therefore, knowledge distillation must be designed under the acknowledgment of this information dissolution. In our method, we design the distillation loss to bring the features in front of the ReLU function, called pre-ReLU. Positive and negative values are preserved in the pre-ReLU position without any deformation. So, it is suitable for distillation.
  4. Distance function
    Most distillation methods adopt L2 or L1 distance. However, in our method, we need to design an appropriate distance function based on our teacher transform and our distillation point in the pre-ReLU position. In our design, the pre-ReLU information is transferred from teacher to student, but negative values of the pre-ReLU feature contain adverse information. Not utilized by the teacher network, the ReLU activation blocks the negative values of the pre-ReLU feature. The transfer of all values may hurt the student network. To tackle this issue, we propose a new distance function, called partial L2 distance, which is designed to skip the distillation of information on a negative region.

Distillation Position

Position of the distillation target layer. We place the distillation layer between the last block and the first ReLU. The exact location differs according to the network architecture. 

The activation function is a crucial component of neural networks. The non-linearity of a neural network attributes to this function. The performance of the model is substantially influenced by the type of activation function. Among various activation functions, rectified linear unit (ReLU) is used in most computer vision tasks, because most networks use ReLU or modified versions that are very similar to ReLU. ReLU applies a linear mapping for positive values, whereas for negative values, it eliminates the values and fixes them to zero. Such elimination prevents unnecessary information from going backward. With a careful design of knowledge distillation using ReLU, it is possible to transfer only the necessary information. Unfortunately, most of the previous research do not take serious consideration of the activation function. We define the minimum unit of the network, such as the residual block in ResNet and the Conv-ReLU in VGG, as a layer block. The distillation in most methods occurs at the end of the layer block, ignoring whether it is related to ReLU or not.
In our algorithm, the position of the distillation lies between the first ReLU and the end of the layer block. This positioning enables the student to reach the preserved information of the teacher before it passes through ReLU. The diagram above depicts the distillation position of some architectures. Whether distillation happens before or after the ReLU constitutes the difference between our proposed method and others, in the case of straightforward and residual blocks. However, for networks with pre-activation, the discrepancy is more considerable: since there is no ReLU at the end of each block, we aim to find the ReLU in the next block. In a structure such as PyramidNet, our method can reach the ReLU after the 1×1 convolution layer. The positioning strategy has a significant influence on performance. Our new distillation position significantly improves the performance of the student, as demonstrated in our experiments.

Loss Function

A comparison of the conventional ReLU, teacher transform in Heo et al., and our proposed method

\sigma_{m} (x) = \mathrm{max}(x, m).
{d_p}(T,S) = \sum\nolimits_i^{WHC} {{{({T_i} - {S_i})}^2}(not\,{S_i} \le {T_i} \le 0)}
L_{distill} = d_p(\sigma_{m_\mathcal{C}}(\boldsymbol{F}_t), r(\boldsymbol{F}_s)).*

* The general loss function: L_{distill} = d({T_t}(\boldsymbol{F}_t), T_s(\boldsymbol{F}_s)).

Batch Normalization

We further investigate batch normalization in knowledge distillation. Batch normalization is used in most recent network architectures to stabilize training. A recent study on batch normalization explains the difference between training mode and evaluation mode. Each mode of batch-norm layer acts differently in the network. Therefore, when it comes to performing knowledge distillation, it is necessary to determine whether to use the teacher in training mode or evaluation mode. Typically, the feature of the student is normalized batch by batch. Thus, the feature from the teacher must be normalized in the same way. In other words, the mode of the teacher’s batch normalization layers should be set to the training mode when information distillation takes place. To accomplish this purpose, we attach a batch normalization layer after the 1×1 convolutional layer, use it as a student transform, and bring the knowledge from the teacher in the training mode. As a result, our proposed method shows improved additional performance. This issue holds for all knowledge distillation methods, including the proposed method. 

Experiment Results

We have assessed the efficiency of our distillation method in several domains. The first task is fundamental in machine learning: classification. Since most existing distillation methods have reported their performance under this domain, we also have used it to better compare our results to theirs. The performance of knowledge distillation generally depends on several factors: the type of network architecture used, the performance of the teacher, and the kind of training scheme under implementation. To control other factors and make a fair comparison, we reproduced the algorithms of other methods based on their codes and papers. The experiments were conducted on NAVER Smart Machine Learning (NSML) platform with PyTorch.

Setup

Compression Type Teacher NetworkStudent Network# of Param – Teacher# of Param -StudentCompress Ratio
(a)DepthWideResNet 28-4WideResNet 16-45.87M2.77M47.2%
(b)ChannelWideResNet 28-5WideResNet 28-25.87M1.47M25.0%
(c)Depth & channelWideResNet 28-6WideResNet 16-25.87M0.70M11.9%
(d)Different architectureWideResNet 28-7ResNet 565.87M0.86M14.7%
(e)Different architecturePyramidNet-200 (240)WideResNet 28-426.84M5.87M21.9%
(f)Different architecturePyramidNet-200 (240)PyramidNet-110 (84)26.84M3.91M14.6%

Experiments settings with various network architectures on CIFAR-100. Network architecture is denoted as WideResNet (depth)- (channel multiplication) for Wide Residual Networks and PyramidNet-(depth) (channel factor) for PyramidNet. Performance of various knowledge distillation methods on CIFAR-100. 

On a mobile device, slide right to see more.

SetupTeacherBaselineKDFitNetsATJacobianFTABProposed
(a)21.0922.7221.6921.8522.0722.1821.7221.3620.89
(b)21.0924.8823.4323.9423.8023.7023.4123.1921.98
(c)21.0927.3226.4726.3026.5626.7125.9126.0224.08
(d)21.0927.6826.7626.3526.6626.6026.2026.0424.44
(e)15.5721.0920.9722.1619.2820.5919.0420.4617.80
(f)15.5722.5821.6823.7919.9323.4919.5320.8918.89
ICLR 2017ICML 2018NIPS 2018AAAI 2018

Measurement is the error rate (%) of classification: the lower, the better. ‘Baseline’ represents a result without distillation. 

On a mobile device, slide right to see more.

Network# of Param (Ratio)MethodTop-1 Error (%)Top-5 Error (%)
ResNet15260.19MTeacher21.695.95
ResNet5025.56M
(42.5%)
Baseline
AT
FT
AB
Proposed
23.72
22.75
22.80
23.47
21.65
6.97
6.35
6.49
6.94
5.83
Network# of Param (Ratio)MethodTop-1 Error (%)Top-5 Error (%)
ResNet5025.56MTeacher23.847.14
MobileNet4.23M
(16.5%)
Baseline
AT
FT
AB
Proposed
31.13
30.44
30.12
31.11
28.75
11.24
10.67
10.50
11.29
9.66

Results on ILSVRC 2012 validation set. Networks are trained and evaluated in 224×224 size with single-crop. ‘Baseline’ represents a result without distillation. 

Network# of ParamMethodmAP(%)
ResNet50-SSD
VGG-SSD
36.7M
26.3M
Teacher (T1)
Teacher (T2)
76.79
77.50
ResNet18-SSD20.0MBaseline
Proposed-T1
Proposed-T2
71.61
73.08
72.38
MobileNet-SSD lite6.5MBaseline
Proposed-T1
Proposed-T2
67.58
68.54
68.45

Object detection results of SSD300 in PASCAL VOC2007 test set (unit: mean Average Precision (mAP)): the higher, the better.

Backbone# of ParamMethodmIoU
ResNet10159.3MTeacher77.39
ResNet1816.6M
(28.0%)
Baseline
Proposed
71.79
73.24
MobileNetV25.8M
(9.8%)
Baseline
Proposed
68.44
71.36

Semantic segmentation based on DeepLabV3+ on the PASCAL VOC 2012 test set (Measurement of performance: mean Intersection over Union (mIoU))

Conclusion

We propose a new knowledge distillation method, along with several investigations about the fundamental aspects of the existing feature distillation methods. We have discovered the effectiveness of pre-ReLU location and then proposed a new loss function to enhance the performance of feature distillation. The new loss function consists of a teacher transform (margin ReLU) and a new distance function (partial L2). Also, it enables an effective feature distillation at the pre-ReLU location. We have also scrutinized the mode of batch normalization in teacher network and achieved additional performance improvements. Thus, we have demonstrated that the proposed method significantly outperforms existing state-of-the-art feature distillation methods by examining the performance of the proposed method by using networks in many tasks.