分布式随机梯度下降矩阵分解算法

5星 · 超过95%的资源需积分: 16 29 浏览量更新于2024-09-14 收藏 259KB PDF 举报

"大型矩阵分解使用分布式随机梯度下降算法" 在大数据时代，协同过滤和分布式计算成为处理大规模数据集的重要工具。这篇论文"Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent"探讨了一种新颖的算法，该算法旨在解决具有数百万行、数百万列以及数十亿非零元素的大规模矩阵的近似分解问题。作者Rainer Gemulla、Peter J. Haas、Erik Nijkamp和Yannis Sismanis分别来自Max-Planck-Institut für Informatik和IBM Almaden Research Center。随机梯度下降（Stochastic Gradient Descent, SGD）是一种常用的迭代式随机优化算法，尤其适用于大规模数据集。论文首先提出了一种名为"stratified" SGD（SSGD）的新变体，它适用于一般损失最小化问题，其中损失函数可以表示为“层损失”的加权和。通过对SSGD进行分析，论文使用了随机逼近理论和再生过程理论，确立了其收敛的充分条件。这确保了算法在处理大规模问题时的稳定性和效率。然后，论文将SSGD专门应用于矩阵分解，提出了分布式SGD（DSGD）算法。DSGD的一大特点是它可以完全分布式运行，适应如MapReduce这样的并行计算框架，非常适合处理Web规模的数据集。这使得DSGD能够应对各种类型的矩阵分解任务，包括但不限于协同过滤应用中的用户-物品矩阵分解。矩阵分解在推荐系统、数据压缩和特征学习等多个领域有广泛应用。通过分布式策略，DSGD能够在不牺牲计算精度的前提下，显著提升处理速度，这对于处理海量数据至关重要。在协同过滤中，DSGD可以高效地学习用户和物品的潜在特征向量，进而预测用户可能对未评分物品的兴趣，从而实现精准的个性化推荐。这篇论文提供了一种创新的分布式算法，用于处理大规模矩阵分解问题，利用随机梯度下降的效率优势，并通过分布式计算实现扩展性。这种方法不仅在理论上建立了坚实的收敛保证，而且在实践中具有广泛的应用潜力，尤其是在需要处理大规模数据的现代信息系统中。

Large-Scale Matrix Factorization

with Distributed Stochastic Gradient Descent

Rainer Gemulla

Peter J. Haas

Erik Nijkamp

Yannis Sismanis

Max-Planck-Institut für Informatik

IBM Almaden Research Center

Saarbrücken, Germany San Jose, CA, USA

rgemulla@mpi-inf.mpg.de {phaas, enijkam, syannis}@us.ibm.com

ABSTRACT

We provide a novel algorithm to approximately factor large matrices

with millions of rows, millions of columns, and billions of nonzero

elements. Our approach rests on stochastic gradient descent (SGD),

an iterative stochastic optimization algorithm. We ﬁrst develop a

novel “stratiﬁed” SGD variant (SSGD) that applies to general loss-

minimization problems in which the loss function can be expressed

as a weighted sum of “stratum losses.” We establish sufﬁcient

conditions for convergence of SSGD using results from stochastic

approximation theory and regenerative process theory. We then

specialize SSGD to obtain a new matrix-factorization algorithm,

called DSGD, that can be fully distributed and run on web-scale

datasets using, e.g., MapReduce. DSGD can handle a wide variety

of matrix factorizations. We describe the practical techniques used to

optimize performance in our DSGD implementation. Experiments

suggest that DSGD converges signiﬁcantly faster and has better

scalability properties than alternative algorithms.

Categories and Subject Descriptors

G.4 [

Mathematics of Computing

]: Mathematical Software—Par-

allel and vector implementations

General Terms

Algorithms, Experimentation, Performance

Keywords

distributed matrix factorization, stochastic gradient descent, MapRe-

duce, recommendation system

1. INTRODUCTION

As Web 2.0 and enterprise-cloud applications proliferate, data

mining algorithms need to be (re)designed to handle web-scale

datasets. For this reason, low-rank matrix factorization has received

much attention in recent years, since it is fundamental to a vari-

ety of mining tasks that are increasingly being applied to massive

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

KDD 2011 August 21-24, 2011, San Diego, CA.

datasets [8, 12, 13, 15, 16]. Speciﬁcally, low-rank matrix factor-

izations are effective tools for analyzing “dyadic data” in order to

discover and quantify the interactions between two given entities.

Successful applications include topic detection and keyword search

(where the corresponding entities are documents and terms), news

personalization (users and stories), and recommendation systems

(users and items). In large applications (see Sec. 2), these problems

can involve matrices with millions of rows (e.g., distinct customers),

millions of columns (e.g., distinct items), and billions of entries

(e.g., transactions between customers and items). At such massive

scales, distributed algorithms for matrix factorization are essential

to achieving reasonable performance [8, 9, 16, 20]. In this paper, we

provide a novel, effective distributed factorization algorithm based

on stochastic gradient descent.

In practice, exact factorization is generally neither possible nor

desired, so virtually all “matrix factorization” algorithms actually

produce low-rank approximations, attempting to minimize a “loss

function” that measures the discrepancy between the original input

matrix and product of the factors returned by the algorithm; we

use the term “matrix factorization” throughout to refer to such an

approximation.

With the recent advent of programmer-friendly parallel processing

frameworks such as MapReduce, web-scale matrix factorizations

have become practicable and are of increasing interest to web com-

panies, as well as other companies and enterprises that deal with

massive data. To facilitate distributed processing, prior approaches

would pick an embarrassingly parallel matrix factorization algorithm

and implement it on a MapReduce cluster; the choice of algorithm

was driven by the ease with which it could be distributed. In this

paper, we take a different approach and start with an algorithm that

is known to have good performance in non-parallel environments.

Speciﬁcally, we start with stochastic gradient descent (SGD), an

iterative optimization algorithm that has been shown, in a sequential

setting, to be very effective for matrix factorization [13]. Although

the generic SGD algorithm (Sec. 3) is not embarrassingly parallel

and hence cannot directly scale to very large data, we can exploit the

special structure of the factorization problem to obtain a version of

SGD that is fully distributed and scales to extremely large matrices.

The key idea is to ﬁrst develop (Sec. 4) a “stratiﬁed” variant of

SGD, called SSGD, that is applicable to general loss-minimization

problems in which the loss function

L(θ)

can be expressed as a

weighted sum of “stratum losses,” so that

L(θ) = w

(θ) + · · · +

(θ)

. At each iteration, the algorithm takes a downhill step

with respect to one of the stratum losses

, i.e., approximately in

the direction of the negative gradient −L

(θ). Although each such

direction is “wrong” with respect to minimization of the overall loss

, we prove that, under appropriate regularity conditions, SSGD

下载后可阅读完整内容，剩余8页未读，立即下载

idealism19890

粉丝: 1
资源: 3

分布式随机梯度下降矩阵分解算法

Projected Gradient Methods for Non-negative Matrix Factorization

MahNMF Manhattan Non-negative Matrix Factorization

A fast distributed stochastic gradient descent algorithm for matrix factorization

Non-negative Matrix Factorization with sparseness constraints

Low-rank matrix factorization with multiple Hypergraph regularizer

Multi-view non-negative matrix factorization by patch alignment framework with view consistency

Algorithms for Non-negative Matrix Factorization

机器学习技法15 - 2 - Basic Matrix Factorization (16-32).mp4

Neural Word Embedding as Implicit Matrix Factorization (5477-neural-word-embedding-as-implicit-matrix-factorization)-计算机科学

2017-Deep Matrix Factorization Models for Recommender Systems.pdf

最新资源