Large-Scale Matrix Factorization
with Distributed Stochastic Gradient Descent
Rainer Gemulla
1
Peter J. Haas
2
Erik Nijkamp
2
Yannis Sismanis
2
1
Max-Planck-Institut für Informatik
2
IBM Almaden Research Center
Saarbrücken, Germany San Jose, CA, USA
rgemulla@mpi-inf.mpg.de {phaas, enijkam, syannis}@us.ibm.com
ABSTRACT
We provide a novel algorithm to approximately factor large matrices
with millions of rows, millions of columns, and billions of nonzero
elements. Our approach rests on stochastic gradient descent (SGD),
an iterative stochastic optimization algorithm. We first develop a
novel “stratified” SGD variant (SSGD) that applies to general loss-
minimization problems in which the loss function can be expressed
as a weighted sum of “stratum losses.” We establish sufficient
conditions for convergence of SSGD using results from stochastic
approximation theory and regenerative process theory. We then
specialize SSGD to obtain a new matrix-factorization algorithm,
called DSGD, that can be fully distributed and run on web-scale
datasets using, e.g., MapReduce. DSGD can handle a wide variety
of matrix factorizations. We describe the practical techniques used to
optimize performance in our DSGD implementation. Experiments
suggest that DSGD converges significantly faster and has better
scalability properties than alternative algorithms.
Categories and Subject Descriptors
G.4 [
Mathematics of Computing
]: Mathematical Software—Par-
allel and vector implementations
General Terms
Algorithms, Experimentation, Performance
Keywords
distributed matrix factorization, stochastic gradient descent, MapRe-
duce, recommendation system
1. INTRODUCTION
As Web 2.0 and enterprise-cloud applications proliferate, data
mining algorithms need to be (re)designed to handle web-scale
datasets. For this reason, low-rank matrix factorization has received
much attention in recent years, since it is fundamental to a vari-
ety of mining tasks that are increasingly being applied to massive
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD 2011 August 21-24, 2011, San Diego, CA.
Copyright 2011 ACM X-XXXXX-XX-X/XX/XX ...$10.00.
datasets [8, 12, 13, 15, 16]. Specifically, low-rank matrix factor-
izations are effective tools for analyzing “dyadic data” in order to
discover and quantify the interactions between two given entities.
Successful applications include topic detection and keyword search
(where the corresponding entities are documents and terms), news
personalization (users and stories), and recommendation systems
(users and items). In large applications (see Sec. 2), these problems
can involve matrices with millions of rows (e.g., distinct customers),
millions of columns (e.g., distinct items), and billions of entries
(e.g., transactions between customers and items). At such massive
scales, distributed algorithms for matrix factorization are essential
to achieving reasonable performance [8, 9, 16, 20]. In this paper, we
provide a novel, effective distributed factorization algorithm based
on stochastic gradient descent.
In practice, exact factorization is generally neither possible nor
desired, so virtually all “matrix factorization” algorithms actually
produce low-rank approximations, attempting to minimize a “loss
function” that measures the discrepancy between the original input
matrix and product of the factors returned by the algorithm; we
use the term “matrix factorization” throughout to refer to such an
approximation.
With the recent advent of programmer-friendly parallel processing
frameworks such as MapReduce, web-scale matrix factorizations
have become practicable and are of increasing interest to web com-
panies, as well as other companies and enterprises that deal with
massive data. To facilitate distributed processing, prior approaches
would pick an embarrassingly parallel matrix factorization algorithm
and implement it on a MapReduce cluster; the choice of algorithm
was driven by the ease with which it could be distributed. In this
paper, we take a different approach and start with an algorithm that
is known to have good performance in non-parallel environments.
Specifically, we start with stochastic gradient descent (SGD), an
iterative optimization algorithm that has been shown, in a sequential
setting, to be very effective for matrix factorization [13]. Although
the generic SGD algorithm (Sec. 3) is not embarrassingly parallel
and hence cannot directly scale to very large data, we can exploit the
special structure of the factorization problem to obtain a version of
SGD that is fully distributed and scales to extremely large matrices.
The key idea is to first develop (Sec. 4) a “stratified” variant of
SGD, called SSGD, that is applicable to general loss-minimization
problems in which the loss function
L(θ)
can be expressed as a
weighted sum of “stratum losses,” so that
L(θ) = w
1
L
1
(θ) + · · · +
w
q
L
q
(θ)
. At each iteration, the algorithm takes a downhill step
with respect to one of the stratum losses
L
s
, i.e., approximately in
the direction of the negative gradient −L
0
s
(θ). Although each such
direction is “wrong” with respect to minimization of the overall loss
L
, we prove that, under appropriate regularity conditions, SSGD