使用Hessian-Free优化训练循环神经网络

需积分: 9 57 浏览量更新于2024-09-08 收藏 295KB PDF 举报

"Learning Recurrent Neural Networks with Hessian-Free Optimization" 本文主要探讨了如何使用Hessian-Free优化方法有效地训练循环神经网络（RNN）解决复杂且具有长期数据依赖性的序列建模问题。作者James Martens和Ilya Sutskever来自加拿大多伦多大学，他们提出了一种新的阻尼方案，结合Hessian-free优化方法（源自Martens, 2010年的研究），成功地在两类挑战性问题上训练RNN。 1. **循环神经网络（RNN）**：RNN是一种能够处理序列数据的深度学习模型，其内部状态可以捕捉到序列中的长期依赖关系。然而，传统的优化方法在处理具有极长依赖关系的序列时往往遇到困难。 2. **Hessian-Free优化**：这是一种优化算法，它不直接计算Hessian矩阵（二阶导数矩阵），而是利用梯度信息进行近似，以改进参数更新的效率和精度。Martens在2010年的工作中对此进行了详细介绍。 3. **阻尼方案**：在文中提出的新型阻尼策略是解决RNN训练难题的关键，它可以防止优化过程中的振荡和不稳定，提高模型的收敛性能。 4. **挑战性问题**：研究中涉及两类问题。第一类是人为构造的具有病态性质的合成数据集，这些数据集具有极长的依赖性，标准优化方法无法有效解决。第二类是三个自然的、高度复杂的现实世界序列数据集。 5. **对比实验**：通过与Hochreiter和Schmidhuber在1997年提出的长短期记忆网络（LSTM）方法比较，作者发现他们的新方法在这些序列建模任务上表现显著优于LSTM，证明了新方法的有效性。 6. **通用高斯-牛顿矩阵的新解释**：文章还对Schraudolph在2002年提出的广义高斯-牛顿矩阵给出了新的理解，该矩阵在Hessian-Free优化方法中起着关键作用。这篇论文解决了训练RNN的长期依赖性问题，通过Hessian-Free优化和创新的阻尼技术，提升了模型在处理复杂序列任务时的性能，并提供了对广义高斯-牛顿矩阵的新见解，为深度学习领域特别是RNN的优化提供了重要的理论支持和实践指导。

Learning Recurrent Neural Networks with Hessian-Free Optimization

James Martens JMARTENS@CS.TORONTO.EDU

Ilya Sutskever ILYA@CS.UTORONTO.CA

University of Toronto, Canada

Abstract

In this work we resolve the long-outstanding

problem of how to effectively train recurrent neu-

ral networks (RNNs) on complex and difﬁcult

sequence modeling problems which may con-

tain long-term data dependencies. Utilizing re-

cent advances in the Hessian-free optimization

approach (Martens, 2010), together with a novel

damping scheme, we successfully train RNNs on

two sets of challenging problems. First, a col-

lection of pathological synthetic datasets which

are known to be impossible for standard op-

timization approaches (due to their extremely

long-term dependencies), and second, on three

natural and highly complex real-world sequence

datasets where we ﬁnd that our method sig-

niﬁcantly outperforms the previous state-of-the-

art method for training neural sequence mod-

els: the Long Short-term Memory approach of

Hochreiter and Schmidhuber (1997). Addition-

ally, we offer a new interpretation of the gen-

eralized Gauss-Newton matrix of Schraudolph

(2002) which is used within the HF approach of

Martens.

1. Introduction

A Recurrent Neural Network (RNN) is a neural network

that operates in time. At each timestep, it accepts an in-

put vector, updates its (possibly high-dimensional) hid-

den state via non-linear activation functions, and uses it

to make a prediction of its output. RNNs form a rich

model class because their hidden state can store informa-

tion as high-dimensional distributed representations (as op-

posed to a Hidden Markov Model, whose hidden state is es-

sentially log n-dimensional) and their nonlinear dynamics

can implement rich and powerful computations, allowing

the RNN to perform modeling and prediction tasks for se-

quences with highly complex structure.

Appearing in Proceedings of the 28

International Conference

by the author(s)/owner(s).

Figure 1. The architecture of a recurrent neural network.

Gradient-based training of RNNs might appear straight-

forward because, unlike many rich probabilistic sequence

models (Murphy, 2002), the exact gradients can be cheaply

computed by the Backpropagation Through Time (BPTT)

algorithm (Rumelhart et al., 1986). Unfortunately, gradi-

ent descent and other 1st-order methods completely fail to

properly train RNNs on large families of seemingly sim-

ple yet pathological synthetic problems that separate a tar-

get output and from its relevant input by many time steps

(Bengio et al., 1994; Hochreiter and Schmidhuber, 1997).

In fact, 1st-order approaches struggle even when the sep-

aration is only 10 timesteps (Bengio et al., 1994). An un-

fortunate consequence of these failures is that these highly

expressive and potentially very powerful time-series mod-

els are seldom used in practice.

The extreme difﬁculty associated with training RNNs is

likely due to the highly volatile relationship between the

parameters and the hidden states. One way that this volatil-

ity manifests itself, which has a direct impact on the per-

formance of gradient-descent, is in the so-called “vanish-

ing/exploding gradients” phenomenon (Bengio et al., 1994;

Hochreiter, 1991), where the error-signals exhibit expo-

nential decay/growth as they are back-propagated through

time. In the case of decay, this leads to the long-term error

signals being effectively lost as they are overwhelmed by

un-decayed short-term signals, and in the case of exponen-

tial growth there is the opposite problem that the short-term

error signals are overwhelmed by the long-term ones.

During the 90’s there was intensive research by the ma-

chine learning community into identifying the source of

difﬁcultly in training RNNs as well as proposing meth-

ods to address it. However, none of these methods be-

came widely adopted, and an analysis by Hochreiter and

Schmidhuber (1996) showed that they were often no bet-

ter than random guessing. In an attempt to sidestep the

difﬁculty of training RNNs on problems exhibiting long-

下载后可阅读完整内容，剩余7页未读，立即下载

ignite678@126.com

粉丝: 2
资源: 42

使用Hessian-Free优化训练循环神经网络

Training Deep and Recurrent Networks with Hessian-Free Optimization

Improved Recurrent Neural Networks for Session-based Recommendations.pdf

Further investigate the stability of complex-valued recurrent neural networks with time-delays

Multistability of recurrent neural networks with piecewise-linear radial basis functions and state-dependent switching parameters

Generating Text with Recurrent Neural Networks (LANG-RNN)-计算机科学

Recurrent Neural Networks for Short-Term Load Forecasting

Hierarchical-Recurrent-Neural-Networks-for-Speech-Bandwidth-Extension:论文编号

Convolutionalneural networks with intra-layer recurrent connections for scene labeling

Convolutional-Recurrent-Neural-Networks-for-Relation-Extraction:卷积递归神经网络用于关系提取的Tensorflow实现

Udemy - Deep Learning Recurrent Neural Networks in Python

最新资源