LSTM：时间序列预测与处理的神经网络架构

1星需积分: 50 166 浏览量更新于2024-07-19 收藏 443KB PDF 举报

"Long Short-Term Memory (LSTM) 是一种循环神经网络（RNN）架构，它能够在任意时间间隔内记住值，并且在学习过程中不会修改存储的值。RNN 允许神经元之间的前向和后向连接。LSTM 对于分类、处理和预测时间序列特别有效，尤其是当重要事件之间的时间滞后大小和持续时间未知时。与其他RNN、隐马尔可夫模型和其他序列学习方法相比，LSTM在许多应用中具有不敏感于时间间隔的优势。" LSTM是深度学习领域中一种重要的序列模型，由Sepp Hochreiter和Jürgen Schmidhuber于1997年提出，旨在解决传统RNN在训练过程中遇到的梯度消失和梯度爆炸问题。在LSTM中，记忆单元（Memory Cell）允许信息在长距离上流动，而门控机制（包括输入门、输出门和遗忘门）则负责控制信息的流动，以保持和丢弃状态信息。 1. 输入门（Input Gate）：控制新信息进入记忆单元的速率，通过一个sigmoid激活函数来决定哪些新信息应该被添加到记忆单元。 2. 遗忘门（Forget Gate）：负责决定应该丢弃记忆单元中的哪些信息，同样使用sigmoid激活函数进行控制。 3. 输出门（Output Gate）：决定了从记忆单元传递到下一个时间步的信息，经过sigmoid激活函数后，再与经过tanh激活函数的细胞状态相乘，以确保输出值在-1到1之间。 4. 细胞状态（Cell State）：LSTM的核心，它保存了长期依赖性，不受梯度消失的影响，因为遗忘门可以控制其更新。 LSTM的这种结构使得它们在处理时间序列数据，如自然语言处理（NLP）、语音识别、机器翻译、视频分析等任务时表现出色。例如，在NLP中，LSTM可以捕获句子中的上下文信息，帮助理解词义；在语音识别中，它可以捕捉连续音频流中的模式；在视频分析中，它可以理解和预测帧间的动作。此外，LSTM的变种也广泛存在，如门控循环单元（GRU），它简化了LSTM的结构，但保留了类似的功能。这些模型通常比传统的RNN更强大，更适合处理具有复杂序列结构的数据。 LSTM在实际应用中，往往与其他深度学习技术结合，如卷积神经网络（CNN）用于图像识别和文本摘要，或者Transformer模型用于更高效的序列建模。LSTM的贡献在于提供了一种有效处理序列数据的手段，极大地推动了深度学习在各种领域的应用和发展。

Using a matrix norm

compatible with vector norm

, we dene

max

:= max

;:::;q

(



)

For max

;:::;n

jg  k

we get

j 

Since

(

net

(



))

j  k

(



)



max

;

we obtain the following inequality:

@ #

(



)

@ #

(

)

j 

(

max

)





(

max

)

This inequality results from

W e

 k

and

 k

;

where

is the unit vector whose comp onents are 0 except for the

-th comp onent, which is 1.

Note that this is a weak, extreme case upp er b ound | it will be reached only if all

(



)

take on maximal values, and if the contributions of all paths across which error ows back from

unit

to unit

have the same sign. Large

, however, typically result in small values of

(



)

, as conrmed by exp eriments (see, e.g., Hochreiter 1991).

For example, with norms

:= max

and

:= max

;

we have

max

= 0

25 for the logistic sigmoid. We observe that if

j 

max

i; j;

then



max

0 will result in exponential decay | by setting





max



we obtain

@ #

(



)

@ #

(

)

j 

(



)

We refer to Ho chreiter's 1991 thesis for additional results.

3.2 CONSTANT ERROR FLOW: NAIVE APPROACH

A single unit.

To avoid vanishing error signals, how can we achieve constant error ow through

a single unit

with a single connection to itself ? According to the rules ab ove, at time

's lo cal

error back ow is

(

) =

(

net

(

))

(

+ 1)

j j

. To enforce

constant

error ow through

, we

require

(

net

(

))

j j

= 1

Note the similarity to Mozer's xed time constant system (1992) | a time constant of 1

0 is

appropriate for potentially innite time lags

The constant error carrousel.

Integrating the dierential equation ab ove, we obtain

(

net

(

)) =

net

(

)

for arbitrary

net

(

). This means:

has to be linear, and unit

's acti-

vation has to remain constant:

(

+ 1) =

(

net

(

+ 1)) =

(

j j

(

)) =

(

)

We do not use the expression \time constant" in the dierential sense, as, e.g., Pearlmutter (1995).

In the exp eriments, this will be ensured by using the identity function

(

) =

, and by

setting

j j

= 1

0. We refer to this as the constant error carrousel (CEC). CEC will b e LSTM's

central feature (see Section 4).

Of course unit

will not only be connected to itself but also to other units. This invokes two

obvious, related problems (also inherent in all other gradient-based approaches):

1. Input weight conict:

for simplicity, let us focus on a single additional input weight

j i

Assume that the total error can b e reduced by switching on unit

in resp onse to a certain input,

and keeping it active for a long time (until it helps to compute a desired output). Provided

is non-

zero, since the same incoming weight has to b e used for b oth storing certain inputs

and

ignoring

others,

j i

will often receive conicting weight up date signals during this time (recall that

linear): these signals will attempt to make

j i

participate in (1) storing the input (by switching

)

and

(2) protecting the input (by preventing

from being switched o by irrelevant later

inputs). This conict makes learning dicult, and calls for a more context-sensitive mechanism

for controlling \write operations" through input weights.

2. Output weight conict:

assume

is switched on and currently stores some previous

input. For simplicity, let us fo cus on a single additional outgoing weight

. The same

has

to b e used for both retrieving

's content at certain times

and

preventing

from disturbing

at other times. As long as unit

is non-zero,

will attract conicting weight up date signals

generated during sequence pro cessing: these signals will attempt to make

participate in (1)

accessing the information stored in

and

| at dierent times | (2) protecting unit

from being

perturb ed by

. For instance, with many tasks there are certain \short time lag errors" that can b e

reduced in early training stages. However, at later training stages

may suddenly start to cause

avoidable errors in situations that already seemed under control by attempting to participate in

reducing more dicult \long time lag errors". Again, this conict makes learning dicult, and

calls for a more context-sensitive mechanism for controlling \read operations" through output

weights.

Of course, input and output weight conicts are not sp ecic for long time lags, but occur for

short time lags as well. Their eects, however, b ecome particularly pronounced in the long time

lag case: as the time lag increases, (1) stored information must be protected against perturbation

for longer and longer p erio ds, and | especially in advanced stages of learning | (2) more and

more already correct outputs also require protection against p erturbation.

Due to the problems ab ove the naive approach do es not work well except in case of certain

simple problems involving lo cal input/output representations and non-rep eating input patterns

(see Hochreiter 1991 and Silva et al. 1996). The next section shows how to do it right.

4 LONG SHORT-TERM MEMORY

Memory cells and gate units

. To construct an architecture that allows for constant error ow

through sp ecial, self-connected units without the disadvantages of the naive approach, we extend

the constant error carrousel CEC embo died by the self-connected, linear unit

from Section 3.2

by introducing additional features. A multiplicative

input gate unit

is introduced to protect the

memory contents stored in

from perturbation by irrelevant inputs. Likewise, a multiplicative

output gate unit

is introduced which protects other units from perturbation by currently irrelevant

memory contents stored in

The resulting, more complex unit is called a

memory cell

(see Figure 1). The

-th memory cell

is denoted

. Each memory cell is built around a central linear unit with a xed self-connection

(the CEC). In addition to

net

gets input from a multiplicative unit

out

(the \output gate"),

and from another multiplicative unit

(the \input gate").

's activation at time

is denoted

(

out

's by

out

(

). We have

out

(

) =

out

(

net

out

(

));

(

) =

(

net

(

));

剩余32页未读，继续阅读

小冬瓜1

粉丝: 0

LSTM：时间序列预测与处理的神经网络架构

Long Short-Term Memory.pdf

Long Short-Term Memory Networks With Python

LSTM（Long Short-Term Memory）长短期记忆网络

long short-term memory

Bidirectional Long Short-Term Memory Networks .pdf

Deep Sentence Embedding Using Long Short-Term Memory Networks

Long Short-Term Memory Networks With Python.pdf

LSTM-MATLAB is Long Short-term Memory (LSTM) i

Recurrent Neural Networks and Long Short-Term Memory .pdf

E-LSTM: An Efficient Hardware Architecture for Long Short-Term Memory

最新资源