深度学习中的知识蒸馏与学生-教师模型：视觉智能应用综述

5星 · 超过95%的资源需积分: 46 58 浏览量更新于2024-07-16 3 收藏 1.8MB PDF 举报

"这篇论文是关于面向视觉智能的知识蒸馏和Student-Teacher学习方法的综述，探讨了这两种技术在深度学习模型压缩和知识转移中的应用。作者Lin Wang和Kuk-Jin Yoon深入解析了知识蒸馏的概念、工作原理及其在解决大模型规模和数据不足问题上的有效性。" 知识蒸馏（Knowledge Distillation，KD）是一种深度学习技术，旨在通过将大型、复杂的教师模型（Teacher Model）学到的知识传递给小型、轻量级的学生模型（Student Model），以提高学生模型的性能，同时减少其计算需求和参数数量。这种方法使得高效的模型可以在边缘设备上部署，解决了大型模型的计算负担和对大量标注数据的依赖。在知识蒸馏过程中，教师模型通常是在大量标注数据上训练得到的高性能模型，它能够捕捉到数据集中的复杂模式。学生模型则试图模仿教师模型的输出概率分布，而不仅仅是单个预测标签。这样做是因为教师模型的输出包含了更多的信息，如软标签（Soft Labels），即非硬性的类别概率，可以提供更丰富的上下文信息，帮助学生模型学习到更多的细节。 Student-Teacher学习框架是实现知识蒸馏的核心结构。在这个框架中，教师模型首先对输入数据进行处理，产生软标签。然后，学生模型通过优化损失函数来拟合这些软标签，以尽可能接近教师模型的行为。这一过程可以理解为学生模型从教师模型的经验中“学习”，从而提高其泛化能力。近年来，知识蒸馏已经被广泛应用于视觉任务，如图像分类、目标检测和语义分割等。论文中，作者对这些领域的研究进展进行了全面的调查，分析了各种不同的S-T学习策略和技术，包括特征匹配、注意力机制转移等。未来的研究方向和挑战包括但不限于：优化知识蒸馏的效率，探索更有效的知识表示和传输方式，改进学生模型的架构设计以更好地适应教师模型的复杂知识，以及在无监督或半监督学习场景下扩展知识蒸馏的应用。此外，如何处理教师模型的错误和不确定性，以及如何在动态环境中适应性地更新学生模型也是当前亟待解决的问题。这篇综述论文为理解知识蒸馏和Student-Teacher学习提供了深入的见解，并为未来的研究提供了有价值的指导，展示了这些技术在视觉智能领域的重要性和潜在影响力。

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, APRIL 2020 7

representation that is the optimal prediction of teacher’s

intermediate representations. Essentially, the function of

hints is a form of regularization; therefore, a pair of hint and

guided (a hidden layer of the student) layers have to be care-

fully chosen such that the student is not over-regularized.

Inspired by [196], many endeavours have been taken to

study how to choose, transport and match the hint layer

(or layers) and the guided layer (or layers) via various layer

transform (e.g., transformer [91], [115]) and distance (e.g.,

MMD [103]) metrics. Generally, the hint learning objective

can be written as:

L(F

, F

) = D(T F

), T F

)) (10)

Where F

and F

are the selected hint and guided layers

of teacher and student. T F

and T F

are the transformer

or regressor functions for the hint layer of teacher and

guided layer of student. D(·) is the distance function(e.g.,

) measuring the similarity of hint and the guided layers.

Fig. 3 depicts the general paradigm of feature-based

distillation. It is shown that various intermediate feature

representations can be extracted from different positions

and are transformed with a certain type of regressor or

transformer. The similarity of the transformed representa-

tions is ﬁnally optimized via some distance metrics D (e.g.,

or L

distance). In this paper, we carefully scrutinize

various design considerations of feature-based KD methods

and summarize four key factors that are usually considered:

transformation of the hint, transform of the guided layer, position

of selected distillation feature and distance metric [91]. In the

following parts, we will analyze and categorize all existing

feature-based KD methods concerning these four aspects.

4.2.1 Transformation of hints

As pointed in [7], the knowledge of teacher should be

easy to learn by the student. To do this, teacher’s hidden

features are usually converted by a transformation function

. Note that transformation of teacher’s knowledge is

very crucial step for feature-based KD since there is risk

of information missing in the process of transformation.

The transformation methods of teacher’s knowledge in AT

[115], MINILM [241], FSP [270], ASL [133], Jacobian [214],

KP [284], SVD [128], SP [229], MEAL [210], KSANC [31]

and NST [103] cause the knowledge missing due to the

reduction of feature dimension. Speciﬁcally, AT [115] and

MINILM [241] focus on attention mechanisms (e.g., self-

attention [230]) via an attention transformer T

to transform

the activation tensor F ∈ R

C×H×W

to C feature maps

F ∈ R

H×W

. FSP [270] and ASL [133] calculate the infor-

mation ﬂow of the distillation based on Gramian matrices,

through which the tensor F ∈ R

C×H×W

is transformed to

G ∈ R

C×N

, where N represents the number of matrices.

Jacobian [214] and SVD [128] map the tensor F ∈ R

C×H×W

to G ∈ R

C×N

based on Jacobians via ﬁrst-order Taylor

series and truncated SVD, respectively, thus inducing in-

formation losing. KP [284] projects F ∈ R

C×H×W

to M

feature maps F ∈ R

M×H×W

, causing lose of knowledge.

Similarly, SP [229] proposes a similarity-preserving knowl-

edge distillation based on the observation that semantically

similar inputs tend to elicit similar activation patterns. To

achieve this goal, the teacher’s feature F ∈ R

B×C×H×W

is transformed to G ∈ R

B×B

, where B is the batch size.

Intuitively, the G encodes the similarity of the activations

at the teacher layer, however, it leads to the information

loss in transformation. MEAL [210] and KSANC [31] both

use pooing to align the intermediate map of the teacher and

student, thus there is information loss when transforming

teacher’s knowledge. NST [103] and PKT [190] match the

distributions of neuron selectivity patterns or the afﬁnity of

data samples between teacher and student networks. The

loss functions are based on minizing the maximum mean

discrepancy (MMD) and Kullback-Leibler (KL) divergence

between these distributions respectively, thus causing infor-

mation loss when selecting neurons.

On the other hand, FT [115] proposes to extract good

factors through which transportable features are made. The

transformer T F

is called paraphraser and the transformer

T F

is called translator. To extract the teacher factors, an

adequately trained paraphraser is needed. Meanwhile, to

enable the student to assimilate and digest the knowledge

according to its own capacity, a user-deﬁned paraphrase

ratio is used in the paraphraser to control the factor of the

transfer. Heo et al. [92] use the original teacher’s feature

in the form of binarized values, namely via a separat-

ing hyperplane (activation boundary (AB)) that determines

whether neurons are activated or deactivated. Since AB only

considers the activation of neurons, not the magnitude of

neuron response, thus there is information loss in the feature

binarization process. Similar information loss happens in

IRG [140], where the teacher’s feature space is transformed

to vertex and edge in graph representation where rela-

tionship matrices are calculated. IR [4] distills the internal

representations of the teacher model to the student model,

however, since multiple layers in the teacher are compressed

into one layer of the student, there is information loss when

matching the features. Heo et al. [91] design T F

with a

margin ReLU function to exclude the negative (adverse)

information and to allow using the positive (beneﬁcial)

information. The margin m is determined based on batch

normalization [105] after 1 × 1 convolution in student’s

transformer T F

Conversely, FitNet [196], RCO [108], Chung et al. [45],

Wang et al. [240], Kulkarni et al. [120] do not add additional

transformation to the teacher’s knowledge, thus no informa-

tion loses from teacher’s side. However, not all knowledge

included in the teacher is beneﬁcial for the student. As

pointed by Heo et al. [91], features include both adverse and

beneﬁcial information, thus it is important to impede the

use of adverse information and avoid missing the beneﬁcial

information.

4.2.2 Transformation of the guided features

On the other aspect, the transformation T F

of the guided

features (namely, student transform) of the student is also

an important step for effective KD. Interestingly, the SOTA

works such as AT [276], MINILM [241], FSP [270], Jacobian

[214], FT [115], SVD [128], SP [229], KP [284], IRG [140],

RCO [108],MEAL [210], KSANC [31], NST [103], Kulkarni et

al. [120] and Aguilar et al. [4] use the T F

same as the T F

which means the same amount of information might lose in

both transformations of the teacher and the student.

On the contrary, different from the transformation of

teacher, FitNet [94], AB [92], Heo et al. [91] and VID [7] do

JOURNAL OF L

X CLASS FILES, VOL. 14, NO. 8, APRIL 2020 8

TABLE 2

A taxonomy of distilling knowledge from the intermediate layers (feature maps). KP incidates knowledge projection.

Method Teacher’s T F

Student’s T F

Distillation position Distance metric Lost knowledge

FitNet [196] None 1 × 1 Conv Middle layer L

None

AT [276] Attention map Attention map End of layer group L

Channel dims

KP [284] Projection matrix Projection matrix Middle layers L

+ KP loss Spatial dims

FSP [270] FSP matrix FSP matrix End of layer group L

Spatial dims

FT [115] Encoder-decoder Encoder-decoder End of layer group L

Channel + Spatial dims

AT [276] Attention map Attention map End of layer group L

Channel dimensions

MINILM [241] Self-ttention Self-attention End of layer group KL Channel dimensions

Jacobian [214] Gradient penalty Gradient penalty End of layer group L

Channel dims

SVD [270] Truncated SVD Truncated SVD End of layer group L

Spatial dims

VID [7] None 1 × 1 Conv Middle layers K L None

IRG [140] Instance graph Instance graph Middle layers L

Spatial dims

RCO [108] None None Teacher’s train route L

None

SP [229] Similarity matrix Similarity matrix Middle layer Frobenius norm None

MEAL [210] Adaptive pooling Adaptive pooling End of layer group L

1/2

/KL/L

GAN

None

Heo [210] Margin ReLU 1 × 1 Conv Pre-ReLU Partial L

Negative features

AB [92] Binarization 1 × 1 Conv Pre-ReLU Margin L

feature values

Chung [45] None None End of layer L

GAN

None

Wang [240] None Adaptation layer Middle layer Margin L

Channel + Spatial dims

KSANC [31] Average pooling Average pooling Middle layers L

+ L

GAN

Spatial dims

Kulkarni [120] None None End of layer group L

None

IR [4] Attention matrix Attention matrix Middle layers KL+ Cosine None

Liu [140] Transform matrix Transform matrix Middle layers KL Spatial dims

NST [103] None None Intermediate layers MMD None

to change the dimension of teacher’s feature representations

and design T F

with a ‘bottleneck’ layer (1×1 convolution)

to make student’s feature to match the dimension with the

teacher. Note that Heo et al. [91] add one batch normaliza-

tion layer after 1 × 1 convolution to calculate the margin

of the proposed margin ReLU transformer of the teacher.

There are some advantages of using 1×1 convolution in KD.

First, it offers a channel-wise pooling without a reduction

of spatial dimensionality. Second, it can be used to create

a one-to-one linear projection of a stack of feature maps.

Lastly, the projection created by 1×1 convolution can also be

used to directly increase the number of feature maps in the

distillation model. In such a case, the feature representation

of student does not decrease but rather increase to match

teacher’s representation, which does not cause information

loss in the transformation of the student.

Exceptionally, some works focus on a different aspect

of the transformation of student’s feature representations.

Wang et al. [240] make the student imitate the ﬁne-grained

local feature regions close to object instances of teacher’s

representations. This goal is achieved by designing a par-

ticular adaptation function T F

to fulﬁll the imitation task.

IR [4] aims to let student acquire the abstraction in a hidden

layer of the teacher by matching the internal representations.

That is, the student is taught to know how to compress

the knowledge from multiple layers of the teacher into

a single layer of it. In such a setting, the transformation

of the student’s guided layer is done by a self-attention

transformer. Chung et al. [45], on the other hand, propose to

impose no transformation to both student and teacher, but

rather add a discriminator to distinguish the feature map

distributions of different networks (teacher or student).

4.2.3 Distillation positions of features

In addition to the transformation of teacher’s and student’s

features, distillation position of the selected features is also

very crucial in many cases. Earlier, FitNet [94], AB [92] and

Wang et al. [240] use the end of an arbitrary middle layer as

the distillation point, however it is shown to have poor dis-

tillation performance. Based on the deﬁnition of layer group

[277], in which a group of layers have same spatial size, AT

[276], FSP [270], Jacobian [214], MEAL [210], KSANC [31]

and Kulkarni et al. [120] determine the distillation point at

the end of each layer group, in contrast to FT [115] and NST

[103] where the position lies only at the end of last layer

group. Compared to FitNet, FT achieves better results since

it focuses on more informational knowledge. Whereas IRG

[140] considers all the above-mentioned critical positions,

namely the distillation position lies not only in the end of

earlier layer group but also in the end of the last layer group.

Interestingly, VID [7], CRO [108], Chung et al. [45], SP [229],

IR [4] and Liu et al. [140] generalize the selection of distilla-

tion positions by employing variational information maxi-

mization [20], curriculum learning [23], adversarial learning

[74], similarity-presentation in representation learning [100],

muti-task learning [47], reinforcement learning [173]. We

will discuss more for these methods in later sections.

4.2.4 Distance metric for measuring distillation

The quality of KD from teacher to student is usually mea-

sured by various distance metrics. The most commonly used

distance function is based on L

or L

distance. FitNet

[196], NST [276], FSP [270], SVD [128], RCO [108], FT [115]

and KSANC [31] are mainly based on L

distance, whereas

MEAL [210], Wang et al. [240] and Kulkarni et al. [120]

mainly use L

distance. On the other hand, Liu et al. [140]

剩余37页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习中的知识蒸馏与学生-教师模型：视觉智能应用综述

最新《知识蒸馏》2020综述论文（来自悉尼大学）

Pytorch实现的各种知识蒸馏方法-python

序列知识蒸馏进展（来自EMNLP 2020）

What+your+idea+uh+uh+uh+good+student+teacher+relationship

student-t 粒子滤波

最新资源