Asynchronous Proximal Stochastic Gradient Algorithm for Composition
Optimization Problems
Pengfei Wang,
1, 2
Risheng Liu,
3
Nenggan Zheng,
1∗
Zhefeng Gong
4
1
Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, Zhejiang, China
2
College of Computer Science and Technology, Zhejiang University, Hangzhou, Zhejiang, China
3
International School of Information Science & Engineering, Dalian University of Technology, Liaoning, China
4
Department of Neurobiology, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
{pfei, zng, zfgong}@zju.edu.cn, rsliu@dlut.edu.cn
Abstract
In machine learning research, many emerging applications
can be (re)formulated as the composition optimization prob-
lem with nonsmooth regularization penalty. To solve this
problem, traditional stochastic gradient descent (SGD) algo-
rithm and its variants either have low convergence rate or
are computationally expensive. Recently, several stochastic
composition gradient algorithms have been proposed, how-
ever, these methods are still inefficient and not scalable to
large-scale composition optimization problem instances. To
address these challenges, we propose an asynchronous par-
allel algorithm, named Async-ProxSCVR, which effectively
combines asynchronous parallel implementation and variance
reduction method. We prove that the algorithm admits the
fastest convergence rate for both strongly convex and gen-
eral nonconvex cases. Furthermore, we analyze the query
complexity of the proposed algorithm and prove that linear
speedup is accessible when we increase the number of pro-
cessors. Finally, we evaluate our algorithm Async-ProxSCVR
on two representative composition optimization problems in-
cluding value function evaluation in reinforcement learn-
ing and sparse mean-variance optimization problem. Exper-
imental results show that the algorithm achieves significant
speedups and is much faster than existing compared methods.
Introduction
Composition optimization problem proposed recently by
Wang et al. (Wang, Fang, and Liu 2014) arises in many
important applications including reinforcement learning
(Wang, Liu, and Tang 2017), statistical learning (Hinton and
Roweis 2003), risk management (Shapiro, Dentcheva, and
Ruszczy
´
nski 2009), and multi-stage stochastic programming
(Shapiro, Dentcheva, and Ruszczy
´
nski 2009). In this paper,
we study the finite-sum scenario for the regularized com-
position optimization problem whose objective function is
the composition of two finite-sum functions plus a possibly
nonsmooth regularization term, i.e.,
min
xPR
N
Hpxq “ fpxq ` hpxq, (1)
∗
Corresponding author
Copyright
c
2019, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
where
fpxq “
1
n
1
n
1
ÿ
i“1
F
i
p
1
n
2
n
2
ÿ
j“1
G
j
pxqq, (2)
where G
j
: R
N
Ñ R
M
, F
i
: R
M
Ñ R, and
f : R
N
Ñ R are continuously differentiable functions.
We denote Gpxq :“
1
n
2
ř
n
2
j“1
G
j
pxq, y :“ Gpxq, and
F pyq :“
1
n
1
ř
n
1
i“1
F
i
pyq. And we call Gpxq the inner func-
tion and F pyq the outer function. Often f pxq is used as
the empirical risk approximation to the composition of two
expected-value function E
i
F
i
pE
j
G
j
pxqq. The regularization
h : R
N
Ñ RYt`8u is an extended real-valued closed con-
vex but possibly nonsmooth function, which is often used to
constrain the capacity of the hypothesis space.
In general, problem (1) is substantially more chal-
lenging than its non-composition counterpart, i.e., empiri-
cal risk minimization (ERM) problem (Friedman, Hastie,
and Tibshirani 2001) with the finite-sum form fpxq “
1{n
ř
n
i“1
F
i
pxq. On the one hand, the composition objec-
tive is nonlinear with respect to the joint distribution of data
indices pi, jq, which leads to the difficulty of an unbiased
sampling for estimating the full gradient ∇fpxq. On the
other hand, the two finite-sums structure causes unprece-
dented computational challenges for traditional stochastic
gradient methods in solving problem (1). For example, to
apply stochastic gradient descent (SGD) method, we need
to compute pBG
j
pxqq
T
∇F
i
pGpxqq in each iteration. The in-
ner function Gpxq is computationally expensive with per-
iteration queries proportional to n
2
.
To solve problem (1), Wang et al. (Wang, Fang, and
Liu 2014; Wang, Liu, and Tang 2017) first propose a class
of stochastic compositional gradient descent methods, i.e.,
SCGD and ASC-PG, which are based on the random eval-
uations of Gpxq, F pyq and their gradients with a low per-
iteration cost. However, SCGD and ASC-PG suffer from
low convergence rate because of the variance induced by
the random samplings. In recent years, variance reduction
based methods have been proposed to improve the con-
vergence rate for the composition optimization problem
(1). Some algorithms employ stochastic variance reduc-
tion gradient (SVRG) method to improve SCGD, includ-
ing Compositional-SVRG-1(Lian, Wang, and Liu 2017),
Compositional-SVRG-2 (Lian, Wang, and Liu 2017), and