ICLR 2016研讨会：深度解读与可视化长短期记忆网络

recurrent

深度学习

需积分: 9 91 浏览量更新于2024-09-08 收藏 2.83MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

Workshop track - ICLR 2016

maintain a memory vector c

. At each time step the LSTM can choose to read from, write to, or

reset the cell using explicit gating mechanisms. The precise form of the update is as follows:



















sigm

tanh









l−1

t−1



= f  c

t−1

+ i  g

= o  tanh(c

)

Here, the sigmoid function sigm and tanh are applied element-wise, and W

is a [4n × 2n] matrix.

The three vectors i, f, o ∈ R

are thought of as binary gates that control whether each memory cell

is updated, whether it is reset to zero, and whether its local state is revealed in the hidden vector,

respectively. The activations of these gates are based on the sigmoid function and hence allowed to

range smoothly between zero and one to keep the model differentiable. The vector g ∈ R

ranges

between -1 and 1 and is used to additively modify the memory contents. This additive interaction

is a critical feature of the LSTM’s design, because during backpropagation a sum operation merely

distributes gradients. This allows gradients on the memory cells c to ﬂow backwards through time

uninterrupted for long time periods, or at least until the ﬂow is disrupted with the multiplicative

interaction of an active forget gate. Lastly, note that an implementation of the LSTM requires one

to maintain two vectors (h

and c

) at every point in the network.

Gated Recurrent Unit (GRU) Cho et al. (2014) recently proposed as a simpler alternative to the

LSTM that takes the form:







sigm





l−1

t−1



= tanh(W

l−1

+ W

(r  h

t−1

))

= (1 − z)  h

t−1

+ z 

Here, W

are [2n × 2n] and W

and W

are [n × n]. The GRU has the interpretation of computing

a candidate hidden vector

and then smoothly interpolating towards it gated by z.

3.2 CHARACTER-LEVEL LANGUAGE MODELING

We use character-level language modeling as an interpretable testbed for sequence learning. In this

setting, the input to the network is a sequence of characters and the network is trained to predict

the next character in the sequence with a Softmax classiﬁer at each time step. Concretely, assuming

a ﬁxed vocabulary of K characters we encode all characters with K-dimensional 1-of-K vectors

}, t = 1, . . . , T , and feed these to the recurrent network to obtain a sequence of D-dimensional

hidden vectors at the last layer of the network {h

}, t = 1, . . . , T . To obtain predictions for the next

character in the sequence we project this top layer of activations to a sequence of vectors {y

}, where

= W

and W

is a [K × D] parameter matrix. These vectors are interpreted as holding the

(unnormalized) log probability of the next character in the sequence and the objective is to minimize

the average cross-entropy loss over all targets.

3.3 OPTIMIZATION

Following previous work of Sutskever et al. (2014) we initialize all parameters uniformly in range

[−0.08, 0.08]. We use mini-batch stochastic gradient descent with batch size 100 and RMSProp

(Dauphin et al. (2015)) per-parameter adaptive update with base learning rate 2 × 10

−3

and decay

0.95. These settings work robustly with all of our models. The network is unrolled for 100 time

steps. We train each model for 50 epochs and decay the learning rate after 10 epochs by multiplying

it with a factor of 0.95 after each additional epoch. We use early stopping based on validation

performance and cross-validate the amount of dropout for each model individually.

4 EXPERIMENTS

Datasets. Two datasets previously used in the context of character-level language models are the

Penn Treebank dataset of Marcus et al. (1993) and the Hutter Prize 100MB of Wikipedia dataset

of Hutter (2012) . However, both datasets contain a mix of common language and special markup.

Our goal is not to compete with previous work but rather to study recurrent networks in a controlled

setting and on both ends on the spectrum of degree of structure. Therefore, we chose to use Leo

Tolstoy’s War and Peace (WP) novel, which consists of 3,258,246 characters of almost entirely

English text with minimal markup, and at the other end of the spectrum the source code of the

Linux Kernel (LK). We shufﬂed all header and source ﬁles randomly and concatenated them into a

single ﬁle to form the 6,206,996 character long dataset. We split the data into train/val/test splits as

80/10/10 for WP and 90/5/5 for LK. Therefore, there are approximately 300,000 characters in the

剩余10页未读，继续阅读

banxia1995

粉丝: 25
资源: 19

ICLR 2016研讨会：深度解读与可视化长短期记忆网络

Building a Recurrent Neural Network - Step by Step

Improvement of Bidirectional Recurrent Neural Network

Recurrent Neural Network

semi-supervised hierarchical recurrent graphneural network for city-wide par

RECURRENT NEURAL NETWORK BASED STOCK PRICE PREDICTION USING MULTIPLE STOCK BRANDS的更多信息

解释下Dual-Stage Attention-Based Recurrent Neural Network模型原理，尤其是两个阶段注意力机制的作用

recurrent neural network

给我一些用Recurrent Neural Network进行股票价格预测的高引用论文

公式推导下Dual-Stage Attention-Based Recurrent Neural Network 的原理

“Recurrent neural network based stock price prediction using financial news and technical indicators”，作者：Vohra, Saurabh Singh, Sandeep Kumar Kaur, Inderjit 可以怎么搜到

the gated recurrent unit

recurrent event network for reason- ing over temporal knowledge graphs.

gated recurrent unit

请列举出所有深度学习网络模型

Gated recurrent units (GRU)

Vohra, Saurabh Singh, Sandeep Kumar Kaur, Inderjit 作者写的Recurrent neural network based stock price prediction using financial news and technical indicators的论文doi

其中基于神经网络的方法有哪些

深度学习神经网络模型有哪些

最新资源