gradient is set to 1 since
∂L
∂L
= 1, thus initializing
the recursion. Manual or automatic differentiatio n
then only requires to define the partial derivative as-
sociated with each type of operation performed by
any node of the graph. When implementing gradi-
ent descent algorithms with manual differentiation
the result tends to be verbose, brittle code tha t lacks
modularity – all bad things in terms of software en-
gineering. A better approach is to expr ess the flow
graph in terms of objects that modularize how to
compute outputs from inputs as well as how to com-
pute the pa rtial derivatives necessary for gradient de-
scent. One can pre -define the operations of these ob-
jects (in a “fo rward propagation” or fprop method)
and their partial derivatives (in a “backward prop-
agation” or bprop method) and encapsulate these
computations in an object that knows how to com-
pute its output given its inputs, a nd how to com-
pute the gradient with respect to its inputs given
the gradient with respect to its output. This is the
strategy adopted in the Theano library
11
with its Op
objects (Bergstra et al., 2010), as well as in libraries
such as Torch
12
(Collobert et al., 2011b) and Lush
13
.
Compared to Torch and Lush, Theano adds an in-
teresting ingredient which makes it a full-fledg ed au-
tomatic differentiation tool: symbolic computation.
The flow graph itself (without the numerical values
attached) can be viewed as a symbolic representation
(in a data structure) of a numerical computation. In
Theano, the gradient computation is first performed
symbolically, i.e., each Op object knows how to create
other Ops correspo nding to the computation of the
partial derivatives associated with that Op. Hence the
symbolic differentiation of the output of a flow graph
with respect to any or all of its input nodes can be
performed easily in mo st cas es, yielding another flow
graph which specifies how to compute these gradi-
ents, given the input of the original graph. Since the
gradient graph typically contains the original graph
(mapping parameters to loss) as a sub-graph, in o r-
der to make computations efficient it is important to
automate (as done in Theano) a number of simplifica-
tions which are gra ph transformations prese rving the
11
http://deeplearning.net/software/theano/
12
http://www.torch.ch
13
http://lush.sourceforge.net
semantics of the output (given the input) but y ield-
ing smaller (or more numerically stable or more effi-
ciently computed) graphs (e.g., removing redundant
computations). To take advantage of the fac t that
computing the loss gradient includes as a first step
computing the loss itself, it is advantageous to struc-
ture the code so that both the loss and its gradient are
computed at once, with a s ingle graph having multi-
ple outputs. The advantages of performing gradient
computations symbolically are numerous. First of all,
one can rea dily compute gradients over gradients, i.e.,
second de rivatives, which are useful for some lear n-
ing algorithms. Second, one can define algorithms or
training criteria involving gradients themselves, as re-
quired for example in the Contractive Auto-Encoder
(which uses the norm of a Jacobian matrix in its
training criterion, i.e., really r equires second deriva-
tives, which here are cheap to compute). Third, it
makes it easy to implement o ther useful graph trans-
formations such as graph simplifications or numerical
optimizations and transformations that help making
the numerical results more robust and more efficient
(such as working in the domain o f logarithms of prob-
abilities rather than in the domain of probabilities
directly). Other potential beneficial applications of
such symbo lic manipulations include parallelization
and additional differential operators (such as the R-
operator , recently implemented in Theano, which is
very useful to compute the product o f a Ja cobian ma-
trix
∂f (x)
∂x
or Hessian matrix
∂
2
L(x,θ)
∂θ
2
with a vector
without ever having to actually compute and store
the matrix itself (Pearlmutter, 1994)).
3 Hyper-Parameters
A pure learning algorithm can be seen as a func-
tion taking training data as input and producing
as output a function (e.g. a predictor) or model
(i.e. a bunch of functions). However, in practice,
many learning algorithms invo lve hyper-parameters,
i.e., annoying knobs to be adjusted. In many algo-
rithms such as Deep Lea rning algorithms the number
of hyper-parameters (ten or more!) can make the idea
of having to adjust all of them unappealing. In addi-
tion, it has been shown that the use of computer clus-
7