Deep Learning Tutorial, Release 0.1
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.
3.4.2 Stochastic Gradient Descent
What is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps down-
ward on an error surface defined by a loss function of some parameters. For the purpose of ordinary gradient
descent we consider that the training data is rolled into the loss function. Then the pseudocode of this algo-
rithm can be described as :
# GRADIENT DESCENT
while True:
loss = f(params)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate
*
d_loss_wrt_params
if <stopping condition is met>:
return params
Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but
proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire
training set. In its purest form, we estimate the gradient from just a single example at a time.
# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
# imagine an infinite generator
# that may repeat examples (if there is only a finite training set)
loss = f(params, x_i, y_i)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate
*
d_loss_wrt_params
if <stopping condition is met>:
return params
The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-
called “minibatches”. Minibatch SGD works identically to SGD, except that we use more than one training
example to make each estimate of the gradient. This technique reduces variance in the estimate of the
gradient, and often makes better use of the hierarchical memory organization in modern computers.
for (x_batch,y_batch) in train_batches:
# imagine an infinite generator
# that may repeat examples
loss = f(params, x_batch, y_batch)
d_loss_wrt_params = ... # compute gradient using theano
params -= learning_rate
*
d_loss_wrt_params
10 Chapter 3. Getting Started