on a closed and bounded subset of R
n
, and any function mapping from any finite dimensional
discrete space to another. This may suggest there is no reason to go beyond MLP1 to more
complex architectures. However, the theoretical result does not state how large the hidden
layer should be, nor does it say anything about the learnability of the neural network (it
states that a representation exists, but does not say how easy or hard it is to set the
parameters based on training data and a specific learning algorithm). It also does not
guarantee that a training algorithm will find the correct function generating our training
data. Since in practice we train neural networks on relatively small amounts of data, using
a combination of the backpropagation algorithm and variants of stochastic gradient descent,
and use hidden layers of relatively modest sizes (up to several thousands), there is benefit
to be had in trying out more complex architectures than MLP1. In many cases, however,
MLP1 does indeed provide very strong results. For further discussion on the representation
power of feed-forward neural networks, see (Bengio et al., 2015, Section 6.5).
4.2 Common Non-linearities
The non-linearity g can take many forms. There is currently no good theory as to which
non-linearity to apply in which conditions, and choosing the correct non-linearity for a
given task is for the most part an empirical question. I will now go over the common non-
linearities from the literature: the sigmoid, tanh, hard tanh and the rectified linear unit
(ReLU). Some NLP researchers also experimented with other forms of non-linearities such
as cube and tanh-cube.
Sigmoid The sigmoid activation function σ(x) = 1/(1 + e
−x
) is an S-shaped function,
transforming each value x into the range [0, 1].
Hyperbolic tangent (tanh) The hyperbolic tangent tanh(x) =
e
2x
−1
e
2x
+1
activation func-
tion is an S-shaped function, transforming the values x into the range [−1, 1].
Hard tanh The hard-tanh activation function is an approximation of the tanh function
which is faster to compute and take derivatives of:
hardtanh(x) =
−1 x < −1
1 x > 1
x otherwise
Rectifier (ReLU) The Rectifier activation function (Glorot, Bordes, & Bengio, 2011),
also known as the rectified linear unit is a very simple activation function that is easy to
work with and was shown many times to produce excellent results.
9
The ReLU unit clips
each value x < 0 at 0. Despite its simplicity, it performs well for many tasks, especially
when combined with the dropout regularization technique (see Section 6.4).
9. The technical advantages of the ReLU over the sigmoid and tanh activation functions is that it does not
involve expensive-to-compute functions, and more importantly that it does not saturate. The sigmoid
and tanh activation are capped at 1, and the gradients at this region of the functions are near zero,
driving the entire gradient near zero. The ReLU activation does not have this problem, making it
especially suitable for networks with multiple layers, which are susceptible to the vanishing gradients
problem when trained with the saturating units.
14