没有合适的资源?快使用搜索试试~ 我知道了~
首页ICLR 2016研讨会:深度解读与可视化长短期记忆网络
在深度学习领域,Recurrent Neural Networks (RNNs)特别是其变种LSTM(长短期记忆网络)由于在处理序列数据时表现出色,引起了广泛关注。然而,尽管LSTM在实践中取得了卓越的成绩,其内部工作机制以及为何能超越传统的n-gram模型(如n-gram语言模型)的原因,仍存在一定的理解空白。本文《ICLR 2016 Workshoptrack》由Andrej Karpathy、Justin Johnson和Li Fei-Fei等人合作,旨在通过分析字符级别的语言模型,探索RNN特别是LSTM的可解释性。 研究者们通过实验揭示了一个重要的发现:在LSTM中存在可以解释的单元,这些单元能够跟踪长期依赖关系,比如文本中的行长度、引号和括号等结构特征。这表明LSTM并非仅仅依靠局部上下文信息,而是有能力捕捉和处理更复杂的序列模式,从而超越了n-gram模型在长距离结构关联上的局限性。 此外,论文中的比较分析深入剖析了LSTM相对于n-gram模型的优势来源,强调了其在捕捉和利用长期结构依赖方面的优越性能。这为理解RNN如何解决序列数据问题提供了新的视角,并为深度学习社区提供了一个理解RNN内部工作原理的重要工具。 这篇研究为深度学习中的RNN可解释性提供了关键洞见,不仅展示了LSTM在实际应用中的强大能力,还为改进模型设计和提高模型透明度提供了理论依据。这对于那些希望在深度学习领域进行更深层次理解和优化的工程师来说,无疑是一份宝贵的参考资料。
资源详情
资源推荐
Workshop track - ICLR 2016
maintain a memory vector c
l
t
. At each time step the LSTM can choose to read from, write to, or
reset the cell using explicit gating mechanisms. The precise form of the update is as follows:
i
f
o
g
=
sigm
sigm
sigm
tanh
W
l
h
l−1
t
h
l
t−1
c
l
t
= f c
l
t−1
+ i g
h
l
t
= o tanh(c
l
t
)
Here, the sigmoid function sigm and tanh are applied element-wise, and W
l
is a [4n × 2n] matrix.
The three vectors i, f, o ∈ R
n
are thought of as binary gates that control whether each memory cell
is updated, whether it is reset to zero, and whether its local state is revealed in the hidden vector,
respectively. The activations of these gates are based on the sigmoid function and hence allowed to
range smoothly between zero and one to keep the model differentiable. The vector g ∈ R
n
ranges
between -1 and 1 and is used to additively modify the memory contents. This additive interaction
is a critical feature of the LSTM’s design, because during backpropagation a sum operation merely
distributes gradients. This allows gradients on the memory cells c to flow backwards through time
uninterrupted for long time periods, or at least until the flow is disrupted with the multiplicative
interaction of an active forget gate. Lastly, note that an implementation of the LSTM requires one
to maintain two vectors (h
l
t
and c
l
t
) at every point in the network.
Gated Recurrent Unit (GRU) Cho et al. (2014) recently proposed as a simpler alternative to the
LSTM that takes the form:
r
z
=
sigm
sigm
W
l
r
h
l−1
t
h
l
t−1
˜
h
l
t
= tanh(W
l
x
h
l−1
t
+ W
l
g
(r h
l
t−1
))
h
l
t
= (1 − z) h
l
t−1
+ z
˜
h
l
t
Here, W
l
r
are [2n × 2n] and W
l
g
and W
l
x
are [n × n]. The GRU has the interpretation of computing
a candidate hidden vector
˜
h
l
t
and then smoothly interpolating towards it gated by z.
3.2 CHARACTER-LEVEL LANGUAGE MODELING
We use character-level language modeling as an interpretable testbed for sequence learning. In this
setting, the input to the network is a sequence of characters and the network is trained to predict
the next character in the sequence with a Softmax classifier at each time step. Concretely, assuming
a fixed vocabulary of K characters we encode all characters with K-dimensional 1-of-K vectors
{x
t
}, t = 1, . . . , T , and feed these to the recurrent network to obtain a sequence of D-dimensional
hidden vectors at the last layer of the network {h
L
t
}, t = 1, . . . , T . To obtain predictions for the next
character in the sequence we project this top layer of activations to a sequence of vectors {y
t
}, where
y
t
= W
y
h
L
t
and W
y
is a [K × D] parameter matrix. These vectors are interpreted as holding the
(unnormalized) log probability of the next character in the sequence and the objective is to minimize
the average cross-entropy loss over all targets.
3.3 OPTIMIZATION
Following previous work of Sutskever et al. (2014) we initialize all parameters uniformly in range
[−0.08, 0.08]. We use mini-batch stochastic gradient descent with batch size 100 and RMSProp
(Dauphin et al. (2015)) per-parameter adaptive update with base learning rate 2 × 10
−3
and decay
0.95. These settings work robustly with all of our models. The network is unrolled for 100 time
steps. We train each model for 50 epochs and decay the learning rate after 10 epochs by multiplying
it with a factor of 0.95 after each additional epoch. We use early stopping based on validation
performance and cross-validate the amount of dropout for each model individually.
4 EXPERIMENTS
Datasets. Two datasets previously used in the context of character-level language models are the
Penn Treebank dataset of Marcus et al. (1993) and the Hutter Prize 100MB of Wikipedia dataset
of Hutter (2012) . However, both datasets contain a mix of common language and special markup.
Our goal is not to compete with previous work but rather to study recurrent networks in a controlled
setting and on both ends on the spectrum of degree of structure. Therefore, we chose to use Leo
Tolstoy’s War and Peace (WP) novel, which consists of 3,258,246 characters of almost entirely
English text with minimal markup, and at the other end of the spectrum the source code of the
Linux Kernel (LK). We shuffled all header and source files randomly and concatenated them into a
single file to form the 6,206,996 character long dataset. We split the data into train/val/test splits as
80/10/10 for WP and 90/5/5 for LK. Therefore, there are approximately 300,000 characters in the
3
剩余10页未读,继续阅读
banxia1995
- 粉丝: 25
- 资源: 19
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C++标准程序库:权威指南
- Java解惑:奇数判断误区与改进方法
- C++编程必读:20种设计模式详解与实战
- LM3S8962微控制器数据手册
- 51单片机C语言实战教程:从入门到精通
- Spring3.0权威指南:JavaEE6实战
- Win32多线程程序设计详解
- Lucene2.9.1开发全攻略:从环境配置到索引创建
- 内存虚拟硬盘技术:提升电脑速度的秘密武器
- Java操作数据库:保存与显示图片到数据库及页面
- ISO14001:2004环境管理体系要求详解
- ShopExV4.8二次开发详解
- 企业形象与产品推广一站式网站建设技术方案揭秘
- Shopex二次开发:触发器与控制器重定向技术详解
- FPGA开发实战指南:创新设计与进阶技巧
- ShopExV4.8二次开发入门:解决升级问题与功能扩展
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功