![](https://csdnimg.cn/release/download_crawler_static/10801565/bg5.jpg)
5
recurrent matrix as it is back-prop a gated through time. In this
case, when the eigenvalues of the recurrent matrix becom e less
than one, the gradient converges to zero rapidly. This happens
normally after 5∼10 steps of back-propagation [6].
In training the RNNs on long sequences (e.g., 1 00
timesteps), the gradients shrink when the weights are sma ll.
Product of a set of real numbers can shrink/explode to
zero/infinity, respective ly. For the matr ices the same analogy
exists but shrinkage/explosion happ ens along some directions.
In [19], it is showed that by considering ρ as the spectral
radius of the recurrent weight matr ix W
HH
, it is necessary at
ρ > 1 for the long term components to explode as t → ∞. It
is possible to use singular values to generalize it to the non-
linear function f
′
H
(·) in Eq. (1) by bounding it with γ ∈ R
such as
||diag(f
′
H
(h
k
))|| ≤ γ. (14)
Using the Eq. (13), the Jacobian matrix
∂h
k+1
∂h
k
, and the bound
in Eq. (14), we can have
||
∂h
k+1
∂h
k
|| ≤ ||W
T
HH
|| · ||di ag(f
′
H
(h
k
))|| ≤ 1. (15)
We can consider ||
∂h
k+1
∂h
k
|| ≤ δ < 1 such as δ ∈ R for each
step k. By continuing it over different timesteps and adding
the loss function compon ent we can have
||
∂L
t
∂h
t
(
t−1
Y
i=k
∂h
i+1
∂h
i
)|| ≤ δ
t−k
||
∂L
t
∂h
t
||. (16)
This equation shows that as t − k gets larger, the long-ter m
dependencies move toward zero and the vanishing problem
happens. Finally, we can see that the sufficient condition for
the gradient vanishing problem to appear is that the la rgest
singular value of the recurrent weights matrix W
HH
(i.e., λ
1
)
satisfies λ
1
<
1
γ
[19].
3) Exploding Gradient Problem: One of the major prob-
lems in training RNNs using BPTT is the exploding gr adient
problem [4]. Gradients in training RNNs on long sequences
may explode as the weights become larger and the norm of
the gradient during training largely increases. As it is stated
in [19], the necessary condition for this situatio n to happen is
λ
1
>
1
γ
.
In order to overcome the exploding gradient prob le m, many
methods have been proposed recently. In 2012 , Mikolov pro-
posed a gradient norm- clipping method to avoid the exploding
gradient problem in tra ining RNNs with simple tools such
as BPTT and SGD on large datasets. [23], [24]. In a similar
approa c h, Pascanu has proposed an almost similar metho d to
Mikolov, by introducing a hyper-parameter as threshold for
norm- c lipping the gradients [19]. This parameter can be set by
heuristics; h owever, the training procedure is not very sensitive
to that and behaves well for rather small thresholds.
4) Stochastic Gradient Descent: The SGD (also called on-
line GD) is a generalization o f GD that is widely in use for
machine learning applications [12]. The SGD is robust, scal-
able, and performs well across many different domains rang ing
from smooth and strongly convex problems to complex non-
convex objectives. Despite the redundant computations in GD,
(a) Classical momentum.
(b) Nesterov accelerated gradient.
Fig. 4: The classical momentum and the Nesterov accelerated gradient
schemes.
the SGD performs one update at a time [25]. For an input-
target pair {x
k
, z} in which k ∈ {1, ..., U}, the parameters in
θ ar e updated according as
θ
t+1
= θ
t
− λ
∂L
k
∂θ
. (17)
Such freq uent update causes fluctuation in the loss fu nc-
tion outputs, which helps the SGD to explore the problem
landscape with higher diversity with the hope of finding
better loca l minima. An adaptive learning rate can control the
convergence of SGD, such that as learning rate decreases, the
exploration decreases and exploitation increases. It leads to
faster convergence to a local minima. A classical technique
to accelera te SGD is using momentum, which accumulates a
velocity vector in directions of persistent reduction towards
the objective across iterations [26]. The classical version of
momentum applies to the loss function L at time t with a set
of parameters θ as
v
t+1
= µv
t
− λ∇L(θ
t
) (18)
where ∇L(·) is the gradient of loss function and µ ∈ [0, 1] is
the mom entum coefficient [9], [12]. As figure 4a shows, the
parameters in θ are update d as
θ
t+1
= θ
t
+ v
t+1
. (19)
By considering R as the condition number of the curvature
at the minimum, the momentum can considerably accelerate
convergence to a local minimum, requiring
√
R times fewer
iterations than steepest descent to reach the same level of
accuracy [26]. In this case, it is sugge sted to set the learning
rate to µ = (
√
R − 1)/(
√
R + 1) [26].
The Nesterov accelerated gradient (NAG) is a first-order
optimization m e thod that provides more e fficient convergence
rate for particular situations (e.g., convex functions with de-
terministic gradient) than the GD [27]. The main difference
between NAG and GD is in the updating rule of the velocity
vector v, as presented in Figure 4b, define d as
v
t+1
= µv
t
− λ∇L(θ + µv
t
) (20)
where the parameters in θ are updated using Eq. (19). By
reasonable fine-tuning of the momentum coefficient µ, it is
possible to increase the optimization performance [9].
5) Mini-Batch Gradient Descent: The mini-batch GD c om-
putes the gradient of a batch of training data which has
more than one trainin g sample. The typical m ini-batch size is
50 ≤ b ≤ 256, but can vary fo r different applications. Feeding