深度学习中的 Dropout 机制：一种贝叶斯近似方法

需积分: 9 131 浏览量更新于2024-09-08 收藏 1.01MB PDF 举报

"这篇论文深入探讨了Dropout机制的基础，将其视为深度神经网络中的贝叶斯近似，用于表示模型不确定性。" 在机器学习领域，尤其是深度学习中，Dropout是一种广泛采用的正则化技术，它对于防止过拟合和提高模型泛化能力起到了重要作用。论文“Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”由Yarin Gal和Zoubin Ghahramani撰写，他们都是剑桥大学的研究人员。该论文提出了一种新的理论框架，将Dropout训练视作深度神经网络（NNs）中的贝叶斯推断，与深度高斯过程相结合。传统的深度学习模型，如回归和分类任务，通常不考虑模型的不确定性。相比之下，贝叶斯模型提供了一个数学上严谨的框架，可以对模型不确定性进行推理，但计算成本往往非常高。作者的创新之处在于他们发现Dropout训练实际上可以作为深度高斯过程的近似贝叶斯推断，这使得我们能够利用现有的Dropout模型来表达不确定性，而这些信息以前常常被忽视。通过这种方式，论文提出的方法在不牺牲计算复杂性或测试准确性的情况下解决了深度学习中表示不确定性的难题。作者对Dropout不确定性进行了广泛的属性研究，涵盖了各种网络架构和非线性函数。这包括了分析Dropout如何影响网络层之间的依赖，以及它如何帮助网络学习更稳健的特征表示。在实际应用中，理解模型的不确定性对于决策制定至关重要，尤其是在涉及风险的领域，如医疗诊断、金融预测和自动驾驶等。Dropout的这一新视角为开发者提供了更多的工具来评估模型的可信度，从而作出更加明智的决策。此外，这项工作也为未来的研究开辟了道路，探索更高效、更精确地表示和利用深度学习模型不确定性的方式。

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

(Wan et al., 2013), multiplicative Gaussian noise (Srivas-

tava et al., 2014), etc.). We show that the dropout objec-

tive, in effect, minimises the Kullback–Leibler divergence

between an approximate distribution and the posterior of

a deep Gaussian process (marginalised over its ﬁnite rank

covariance function parameters). Due to space constraints

we refer the reader to the appendix for an in depth review

of dropout, Gaussian processes, and variational inference

(section 2), as well as the main derivation for dropout and

its variations (section 3). The results are summarised here

and in the next section we obtain uncertainty estimates for

dropout NNs.

Let

y be the output of a NN model with L layers and a loss

function E(·, ·) such as the softmax loss or the Euclidean

loss (square loss). We denote by W

the NN’s weight ma-

trices of dimensions K

× K

i−1

, and by b

the bias vec-

tors of dimensions K

for each layer i = 1, ..., L. We de-

note by y

the observed output corresponding to input x

for 1 ≤ i ≤ N data points, and the input and output sets

as X, Y. During NN optimisation a regularisation term is

often added. We often use L

regularisation weighted by

some weight decay λ, resulting in a minimisation objective

(often referred to as cost),

dropout

i=1

E(y

) + λ

i=1



||W

+ ||b



(1)

With dropout, we sample binary variables for every input

point and for every network unit in each layer (apart from

the last one). Each binary variable takes value 1 with prob-

ability p

for layer i. A unit is dropped (i.e. its value is set

to zero) for a given input if its corresponding binary vari-

able takes value 0. We use the same values in the backward

pass propagating the derivatives to the parameters.

In comparison to the non-probabilistic NN, the deep Gaus-

sian process is a powerful tool in statistics that allows us to

model distributions over functions. Assume we are given a

covariance function of the form

K(x, y) =

p(w)p(b)σ(w

x + b)σ(w

y + b)dwdb

with some element-wise non-linearity σ(·) and distribu-

tions p(w), p(b). In sections 3 and 4 in the appendix we

show that a deep Gaussian process with L layers and co-

variance function K(x, y) can be approximated by placing

a variational distribution over each component of a spec-

tral decomposition of the GPs’ covariance functions. This

spectral decomposition maps each layer of the deep GP to

a layer of explicitly represented hidden units, as will be

brieﬂy explained next.

Let W

be a (now random) matrix of dimensions K

i−1

for each layer i, and write ω = {W

}

i=1

. A priori,

we let each row of W

distribute according to the p(w)

above. In addition, assume vectors m

of dimensions K

for each GP layer. The predictive probability of the deep

GP model (integrated w.r.t. the ﬁnite rank covariance func-

tion parameters ω) given some precision parameter τ > 0

can be parametrised as

p(y|x, X, Y) =

p(y|x, ω)p(ω|X, Y)dω (2)

p(y|x, ω) = N



y(x, ω), τ

−1





x, ω = {W

, ...,W

}





...



x + m



...



The posterior distribution p(ω|X, Y) in eq. (2) is in-

tractable. We use q(ω), a distribution over matrices whose

columns are randomly set to zero, to approximate the in-

tractable posterior. We deﬁne q(ω) as:

= M

· diag([z

i,j

]

j=1

)

i,j

∼ Bernoulli(p

) for i = 1, ..., L, j = 1, ..., K

i−1

given some probabilities p

and matrices M

as variational

parameters. The binary variable z

i,j

= 0 corresponds then

to unit j in layer i − 1 being dropped out as an input to

layer i. The variational distribution q(ω) is highly multi-

modal, inducing strong joint correlations over the rows of

the matrices W

(which correspond to the frequencies in

the sparse spectrum GP approximation).

We minimise the KL divergence between the approximate

posterior q(ω) above and the posterior of the full deep GP,

p(ω|X, Y). This KL is our minimisation objective

−

q(ω) log p(Y|X, ω)dω + KL(q(ω)||p(ω)). (3)

We rewrite the ﬁrst term as a sum

−

n=1

q(ω) log p(y

, ω )dω

and approximate each term in the sum by Monte Carlo in-

tegration with a single sample

∼ q(ω) to get an unbi-

ased estimate − log p(y

). We further approximate

the second term in eq. (3) and obtain

i=1



||M

||m



with prior length-scale l (see section 4.2 in the

appendix). Given model precision τ we scale the result by

the constant 1/τN to obtain the objective:

GP-MC

∝

n=1

− log p(y

)

(4)

i=1



2τN

||M

2τN

||m



input and output set

distribution over functions ???

???

剩余11页未读，继续阅读

superfly2020

粉丝: 0

深度学习中的 Dropout 机制：一种贝叶斯近似方法

"基于GPRS技术的广域网数据传递设计与实现

【自动化毕业论文】多级供应链状态空间模型与仿真设计

"基于MATLAB的通信系统设计与仿真

Introducing MATLAB Fundamental Classes (Data Types).zip

Introducing the compliance manager toolkit

Introducing the Modern WebKit API

Introducing the Java Message Service

Introducing the Arm architecture.pdf

Introducing the 3GPP LTE Downlink

Introducing the Microsoft .NET Framework 3.0

最新资源