Sum of Functions Optimizer
bine Hessian information from multiple subfunctions in a
much more natural and efficient way than previous work,
and avoids the requirement of large minibatches per up-
date step to accurately estimate the full Hessian. More-
over, we develop a novel method to maintain computa-
tional tractability and limit the memory requirements of
this quasi-Newton method in the face of high dimensional
optimization problems (large M). We do this by storing
and manipulating the subfunctions in a shared, adaptive
low dimensional subspace, determined by the recent his-
tory of the gradients and iterates.
Thus our optimization method can usefully estimate and
utilize powerful second-order information while simulta-
neously combatting two potential sources of computational
intractability: large numbers of subfunctions (large N) and
a high-dimensional optimization domain (large M). More-
over, the use of a second order approximation means that
minimal or no adjustment of hyperparameters is required.
We refer to the resulting algorithm as Sum of Functions
Optimizer (SFO). We demonstrate that the combination of
techniques and new ideas inherent in SFO results in faster
optimization on seven disparate example problems. Fi-
nally, we release the optimizer and the test suite as open
source Python and MATLAB packages.
2. Algorithm
Our goal is to combine the benefits of stochastic and quasi-
Newton optimization techniques. We first describe the gen-
eral procedure by which we optimize the parameters x.
We then describe the construction of the shared low di-
mensional subspace which makes the algorithm tractable
in terms of computational overhead and memory for large
problems. This is followed by a description of the BFGS
method by which an online Hessian approximation is main-
tained for each subfunction. Finally, we end this section
with a review of implementation details.
2.1. Approximating Functions
We define a series of functions G
t
(x) intended to approxi-
mate F (x),
G
t
(x) =
N
X
i=1
g
t
i
(x) , (3)
where the superscript t indicates the learning iteration.
Each g
t
i
(x) serves as a quadratic approximation to the cor-
responding f
i
(x). The functions g
t
i
(x) will be stored, and
one of them will be updated per learning step.
2.2. Update Steps
As is illustrated in Figure 1, optimization is performed by
repeating the steps:
1. Choose a vector x
t
by minimizing the approximating
objective function G
t−1
(x),
x
t
= argmin
x
G
t−1
(x) . (4)
Since G
t−1
(x) is a sum of quadratic functions
g
t−1
i
(x), it can be exactly minimized by a Newton
step,
x
t
= x
t−1
− η
t
H
t−1
−1
∂G
t−1
x
t−1
∂x
, (5)
where H
t−1
is the Hessian of G
t−1
(x). The step
length η
t
is typically unity, and will be discussed in
Section 3.5.
2. Choose an index j ∈ {1...N }, and update the cor-
responding approximating subfunction g
t
i
(x) using a
second order power series around x
t
, while leaving all
other subfunctions unchanged,
g
t
i
(x) =
g
t−1
i
(x) i 6= j
f
i
(x
t
)
+ (x − x
t
)
T
f
0
i
(x
t
)
+
1
2
(x − x
t
)
T
H
t
i
(x − x
t
)
i = j
.
(6)
The constant and first order term in Equation 6 are set
by evaluating the subfunction and gradient, f
j
(x
t
) and
f
0
j
(x
t
). The quadratic term H
t
j
is set by using the BFGS
algorithm to generate an online approximation to the true
Hessian of subfunction j based on its history of gradient
evaluations (see Section 2.4). The Hessian of the summed
approximating function G
t
(x) in Equation 5 is the sum of
the Hessians for each g
t
j
(x), H
t
=
P
j
H
t
j
.
2.3. A Shared, Adaptive, Low-Dimensional
Representation
The dimensionality M of x ∈ R
M
is typically large. As
a result, the memory and computational cost of working
directly with the matrices H
t
i
∈ R
M×M
is typically pro-
hibitive, as is the cost of storing the history terms ∆f
0
and
∆x required by BFGS (see Section 2.4). To reduce the
dimensionality from M to a tractable value, all history is
instead stored and all updates computed in a lower dimen-
sional subspace, with dimensionality between K
min
and
K
max
. This subspace is constructed such that it includes
the most recent gradient and position for every subfunc-
tion, and thus K
min
≥ 2N. This guarantees that the sub-
space includes both the steepest gradient descent direction
over the full batch, and the update directions from the most
recent Newton steps (Equation 5).