methods, as well as a deeper look into their abilities to guarantee improvement in the underlying
expected risk R. We start to investigate these topics in the next subsection.
We remark in passing that the stochastic and batch approaches mentioned here have analogues
in the simulation and stochastic optimization communities, where they are referred to as stochastic
approximation (SA) and sample average approximation (SAA), respectively [60].
Inset 3.1: Herbert Robbins and Stochastic Approximation
The paper by Robbins and Monro [126] represents a landmark in the history of numerical
optimization methods. Together with the invention of back propagation [128, 129], it also
represents one of the most notable developments in the field of machine learning. The SG
method was first proposed in [126], not as a gradient method, but as a Markov chain.
Viewed more broadly, the works by Robbins and Monro [126] and Kalman [80] mark the
beginning of the field of stochastic approximation, which studies the behavior of iterative meth-
ods that use noisy signals. The initial focus on optimization led to the study of algorithms
that track the solution of the ordinary differential equation ˙w = −∇F (w). Stochastic approx-
imation theory has had a major impact in signal processing and in areas closer to the subject
of this paper, such as pattern recognition [2] and neural networks [18].
After receiving his PhD, Herbert Robbins became a lecturer at New York University, where
he co-authored with Richard Courant the popular book What is Mathematics? [36], which is
still in print after more than seven decades [37]. Robbins went on to become one of the
most prominent mathematicians of the second half of the twentieth century, known for his
contributions to probability, algebra, and graph theory.
3.3 Motivation for Stochastic Methods
Before discussing the strengths of stochastic methods such as SG, one should not lose sight of
the fact that batch approaches possess some intrinsic advantages. First, when one has reduced
the stochastic problem of minimizing the expected risk R to focus exclusively on the deterministic
problem of minimizing the empirical risk R
n
, the use of full gradient information at each iterate
opens the door for many deterministic gradient-based optimization methods. That is, in a batch
approach, one has at their disposal the wealth of nonlinear optimization techniques that have been
developed over the past decades, including the full gradient method (3.8), but also accelerated
gradient, conjugate gradient, quasi-Newton, and inexact Newton methods [113]. (See §6 and §7 for
discussion of these techniques.) Second, due to the sum structure of R
n
, a batch method can easily
benefit from parallelization since the bulk of the computation lies in evaluations of R
n
and ∇R
n
.
Calculations of these quantities can even be done in a distributed manner.
Despite these advantages, there are intuitive, practical, and theoretical reasons for following a
stochastic approach. Let us motivate them by contrasting the hallmark SG iteration (3.7) with the
full batch gradient iteration (3.8).
Intuitive Motivation On an intuitive level, SG employs information more efficiently than a
batch method. To see this, consider a situation in which a training set, call it S, consists of ten
copies of a set S
sub
. A minimizer of empirical risk for the larger set S is clearly given by a minimizer
for the smaller set S
sub
, but if one were to apply a batch approach to minimize R
n
over S, then
16