JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 14, NO. 8, APRIL 2020 7
representation that is the optimal prediction of teacher’s
intermediate representations. Essentially, the function of
hints is a form of regularization; therefore, a pair of hint and
guided (a hidden layer of the student) layers have to be care-
fully chosen such that the student is not over-regularized.
Inspired by [196], many endeavours have been taken to
study how to choose, transport and match the hint layer
(or layers) and the guided layer (or layers) via various layer
transform (e.g., transformer [91], [115]) and distance (e.g.,
MMD [103]) metrics. Generally, the hint learning objective
can be written as:
L(F
T
, F
S
) = D(T F
t
(F
T
), T F
s
(F
S
)) (10)
Where F
T
and F
S
are the selected hint and guided layers
of teacher and student. T F
t
and T F
s
are the transformer
or regressor functions for the hint layer of teacher and
guided layer of student. D(·) is the distance function(e.g.,
l
2
) measuring the similarity of hint and the guided layers.
Fig. 3 depicts the general paradigm of feature-based
distillation. It is shown that various intermediate feature
representations can be extracted from different positions
and are transformed with a certain type of regressor or
transformer. The similarity of the transformed representa-
tions is finally optimized via some distance metrics D (e.g.,
L
1
or L
2
distance). In this paper, we carefully scrutinize
various design considerations of feature-based KD methods
and summarize four key factors that are usually considered:
transformation of the hint, transform of the guided layer, position
of selected distillation feature and distance metric [91]. In the
following parts, we will analyze and categorize all existing
feature-based KD methods concerning these four aspects.
4.2.1 Transformation of hints
As pointed in [7], the knowledge of teacher should be
easy to learn by the student. To do this, teacher’s hidden
features are usually converted by a transformation function
T
t
. Note that transformation of teacher’s knowledge is
very crucial step for feature-based KD since there is risk
of information missing in the process of transformation.
The transformation methods of teacher’s knowledge in AT
[115], MINILM [241], FSP [270], ASL [133], Jacobian [214],
KP [284], SVD [128], SP [229], MEAL [210], KSANC [31]
and NST [103] cause the knowledge missing due to the
reduction of feature dimension. Specifically, AT [115] and
MINILM [241] focus on attention mechanisms (e.g., self-
attention [230]) via an attention transformer T
t
to transform
the activation tensor F ∈ R
C×H×W
to C feature maps
F ∈ R
H×W
. FSP [270] and ASL [133] calculate the infor-
mation flow of the distillation based on Gramian matrices,
through which the tensor F ∈ R
C×H×W
is transformed to
G ∈ R
C×N
, where N represents the number of matrices.
Jacobian [214] and SVD [128] map the tensor F ∈ R
C×H×W
to G ∈ R
C×N
based on Jacobians via first-order Taylor
series and truncated SVD, respectively, thus inducing in-
formation losing. KP [284] projects F ∈ R
C×H×W
to M
feature maps F ∈ R
M×H×W
, causing lose of knowledge.
Similarly, SP [229] proposes a similarity-preserving knowl-
edge distillation based on the observation that semantically
similar inputs tend to elicit similar activation patterns. To
achieve this goal, the teacher’s feature F ∈ R
B×C×H×W
is transformed to G ∈ R
B×B
, where B is the batch size.
Intuitively, the G encodes the similarity of the activations
at the teacher layer, however, it leads to the information
loss in transformation. MEAL [210] and KSANC [31] both
use pooing to align the intermediate map of the teacher and
student, thus there is information loss when transforming
teacher’s knowledge. NST [103] and PKT [190] match the
distributions of neuron selectivity patterns or the affinity of
data samples between teacher and student networks. The
loss functions are based on minizing the maximum mean
discrepancy (MMD) and Kullback-Leibler (KL) divergence
between these distributions respectively, thus causing infor-
mation loss when selecting neurons.
On the other hand, FT [115] proposes to extract good
factors through which transportable features are made. The
transformer T F
t
is called paraphraser and the transformer
T F
s
is called translator. To extract the teacher factors, an
adequately trained paraphraser is needed. Meanwhile, to
enable the student to assimilate and digest the knowledge
according to its own capacity, a user-defined paraphrase
ratio is used in the paraphraser to control the factor of the
transfer. Heo et al. [92] use the original teacher’s feature
in the form of binarized values, namely via a separat-
ing hyperplane (activation boundary (AB)) that determines
whether neurons are activated or deactivated. Since AB only
considers the activation of neurons, not the magnitude of
neuron response, thus there is information loss in the feature
binarization process. Similar information loss happens in
IRG [140], where the teacher’s feature space is transformed
to vertex and edge in graph representation where rela-
tionship matrices are calculated. IR [4] distills the internal
representations of the teacher model to the student model,
however, since multiple layers in the teacher are compressed
into one layer of the student, there is information loss when
matching the features. Heo et al. [91] design T F
t
with a
margin ReLU function to exclude the negative (adverse)
information and to allow using the positive (beneficial)
information. The margin m is determined based on batch
normalization [105] after 1 × 1 convolution in student’s
transformer T F
s
.
Conversely, FitNet [196], RCO [108], Chung et al. [45],
Wang et al. [240], Kulkarni et al. [120] do not add additional
transformation to the teacher’s knowledge, thus no informa-
tion loses from teacher’s side. However, not all knowledge
included in the teacher is beneficial for the student. As
pointed by Heo et al. [91], features include both adverse and
beneficial information, thus it is important to impede the
use of adverse information and avoid missing the beneficial
information.
4.2.2 Transformation of the guided features
On the other aspect, the transformation T F
s
of the guided
features (namely, student transform) of the student is also
an important step for effective KD. Interestingly, the SOTA
works such as AT [276], MINILM [241], FSP [270], Jacobian
[214], FT [115], SVD [128], SP [229], KP [284], IRG [140],
RCO [108],MEAL [210], KSANC [31], NST [103], Kulkarni et
al. [120] and Aguilar et al. [4] use the T F
s
same as the T F
t
,
which means the same amount of information might lose in
both transformations of the teacher and the student.
On the contrary, different from the transformation of
teacher, FitNet [94], AB [92], Heo et al. [91] and VID [7] do