仅遗忘门LSTM：超越标准模型的效率与性能

版权申诉

PDF格式 | 551KB | 更新于2024-07-19 | 108 浏览量 | 举报

"这篇论文研究了遗忘门在长短时记忆（LSTM）网络中的作用，以及是否所有门都是必不可少的。研究发现，遗忘门对于LSTM的性能至关重要，且一个只包含遗忘门并带有时间初始化偏差的LSTM变体不仅减少了计算量，还在多个基准数据集上超过了标准LSTM的表现。提出的新网络JANET在MNIST和pMNIST数据集上的表现优于标准LSTM，显示出其高效能和竞争力。" 在人工智能领域，记忆网络是用于处理序列数据和理解长期依赖关系的重要工具。LSTM作为其中的一种，通过其独特的门控机制——输入门、遗忘门和输出门，能够在长序列中有效地捕获和存储信息。然而，这引发了一个问题：每个门的作用是否都不可或缺？论文"遗忘门的不合理效力"针对这个问题进行了深入探讨。传统的LSTM网络包含三个门，每个门都有其特定功能。输入门允许新信息进入细胞状态，遗忘门负责清除不再需要的信息，而输出门则控制细胞状态如何影响网络的输出。论文指出，遗忘门是LSTM中最关键的部分，它在保持和丢弃历史信息中起着决定性作用。研究者提出了一种仅包含遗忘门的LSTM变体，名为JANET，它通过时间初始化偏差来优化其性能。这种简化的设计减少了计算复杂度，但实验结果表明，JANET在多个基准数据集（如MNIST和pMNIST）上的表现不逊于甚至超过了标准LSTM。MNIST数据集常用于手写数字识别，而pMNIST则是其像素化的版本，这两个数据集的高精度验证了JANET的有效性。 JANET的出色表现证明了遗忘门在LSTM中的核心地位，即使没有其他门，遗忘门也能有效地处理序列信息。这为资源受限的现实世界应用提供了新的可能性，因为简化版的LSTM可以降低计算需求，同时保持高性能。这项工作强调了在设计神经网络时，对每个组件的精简和优化可以带来意想不到的效果。遗忘门的不合理效力挑战了传统的认知，即复杂网络结构总是优于简化版本，这为未来的研究和工程实践提供了新的思路，即在保证性能的同时，寻找更有效、更简洁的解决方案。

展开

simpliﬁcation of the LSTM provides a network that yields classiﬁcation accuracies at least as good

as the standard LSTM and often performs substantially better – a result not achieved by the models

proposed in the afore-mentioned studies.

3 JUST ANOTHER NETWORK

Recurrent neural networks (RNNs) typically create a lossy summary

of a sequence. It is lossy

because it maps an arbitrarily long sequence

1:T

into a ﬁxed length vector. As mentioned before,

recent work has shown that this forgetting property of LSTMs is one of the most important (Greff

et al., 2015; Jozefowicz et al., 2015). Hence, we propose a simple transformation of the LSTM

that leaves it with only a forget gate, and since this is Just Another NETwork (JANET), we name it

accordingly. We start from the standard LSTM (Lipton et al., 2015), which, with symbols taking their

standard meaning, is deﬁned as

= σ(U

t−1

+ W

+ b

)

= σ(U

t−1

+ W

+ b

)

= σ(U

t−1

+ W

+ b

)

= f

 c

t−1

+ i

 tanh(U

t−1

+ W

+ b

)

= o

 tanh(c

). (1)

To transform the above into the JANET architecture, the input and output gates are removed. It

seems sensible to have the accumulation and deletion of information be related, therefore we couple

the input and forget modulation as in Greff et al. (2015), which is similar to the leaky unit imple-

mentation (Jaeger, 2002,

8.1). Furthermore, the

tanh

activation of

shrinks the gradients during

backpropagation, which could exacerbate the vanishing gradient problem, and since the weights

∗

can accommodate values beyond the range [-1,1], we can remove this unnecessary, potentially

problematic, tanh nonlinearity. The resulting JANET is given by

= σ(U

t−1

+ W

+ b

)

= f

 c

t−1

+ (1 − f

)  tanh(U

t−1

+ W

+ b

)

= c

. (2)

Intuitively, allowing slightly more information to accumulate than the amount forgotten would make

sequence analysis easier. We found this to be true empirically by subtracting a pre-speciﬁed value

from the input control component

, as given by

= U

t−1

+ W

+ b

= tanh(U

t−1

+ W

+ b

)

= σ(s

)  c

t−1

+ (1 − σ(s

− β)) 

= c

. (3)

We speculate that the value for

is dataset dependent, however, we found that setting

β = 1

provides

the best results for the datasets analysed in this study, which have sequence lengths varying from 200

to 784.

If we follow the standard parameter initialization scheme for LSTMs, the JANET quickly encounters

a problem. The standard procedure is to initialize the weights

∗

and

∗

to be distributed as



−

√

l+1

√

l+1



, where

is the size of each layer

(He et al., 2015b; Glorot and

Bengio, 2010), and to initialize all biases to zero except for the forget gate bias

, which is initialized

to one (Jozefowicz et al., 2015). Hence, if the values of both input and hidden layers are zero-centred

over time,

will be centred around

σ(1) = 0.7311

. In this case, the memory values

of the

JANET would not be retained for more than a couple of time steps. This problem is best exempliﬁed

by the MNIST dataset (LeCun, 1998) processed in scanline order (Cooijmans et al., 2016); each

training example contains many consecutive zero-valued subsequences, each of length 10 to 20. In

the best case scenario – a length 10 zero-valued subsequence – the memory values at the end of the

subsequence would be centred around

t+10

= f

 c

≤ 0.7311

≤ 0.04363c

. (4)

β is a constant-valued column vector of the appropriate size.

下载后可阅读完整内容，剩余14页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

电动汽车控制与安全

粉丝: 280

仅遗忘门LSTM：超越标准模型的效率与性能

人工智能中深度学习的不合理有效性（The unreasonable effectiveness of DL in AI）.pdf

1804.04849THE UNREASONABLE EFFECTIVENESS OF THE FORGET.zip

The Unreasonable Effectiveness of Data

2018 cvpr the unreasonable effectiveness of deep features as a perceptual metric

sparcely-unreasonable

fromtimetotime的使用方法.pdf

主谓一致专项练习2.pdf

Space Vehicle Navigation Guidance And Control.pdf

我国新能源汽车产业化发展问题及对策研究.pdf

Case Analysis and Solutions for MySQL Database Index Ineffectiveness (The Grand Unveiling of Index ...

最新资源