利用少数支配者恢复大规模矩阵的矩阵分解方法

154 浏览量更新于2024-08-26 收藏 1.46MB PDF 举报

本文是一篇研究论文，标题为《来自几个支配者的矩阵分解》，作者是Quan Liu、Yanbo Wang和Bo Yuan。论文主要探讨了Pareto原理在矩阵完成问题中的应用，该原理指出大部分结果是由少数主导因素引起的。现实世界的数据集往往表现出非均匀观测分布，然而大多数现有的算法都是基于均匀观测设计的，这可能并不适用于实际场景。论文的核心贡献在于提出了一种新颖的矩阵分解方法，目的是从原始矩阵中恢复大规模矩阵，特别是通过识别并利用包含最重要行和列的主导子矩阵。这个过程的关键在于评估行或列的重要性，作者借鉴了自然语言处理中的术语频率-逆文档频率（Term Frequency-Inverse Document Frequency，TF-IDF）概念，这是一种衡量词语在文档集合中独特性的重要度量。具体步骤如下：首先，通过一种已有的基础矩阵分解算法来恢复出这个重要的主导子矩阵。然后，利用完成后的子矩阵的因子，推断出整个矩阵的因子。这种方法的优势在于，即使数据集中存在非均匀分布，也能有效地聚焦于少数关键部分，提高矩阵分解的准确性和效率。通过这种方法，论文旨在解决大规模矩阵分解时存在的问题，特别是当数据中存在显著的主导元素时，可以更有效地提取出隐藏的模式和结构。这种针对非均匀观测设计的算法，有望在实际的推荐系统、数据挖掘以及机器学习等领域中发挥重要作用，提升模型的预测能力和解释性。

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Matrix Factorization From a Few Dominators

Quan Liu , Yanbo Wang, and Bo Yuan

Abstract— The Pareto principle states that most effects are the result of

a few dominating causes. This principle also ﬁts most matrix completion

problems. In practice, most real-world data sets exhibit nonuniformly

distributed observations. Unfortunately, most existing algorithms are

designed based on uniformly distributed observations. In this brief,

we propose a matrix factorization approach to recover a large-scale

matrix from a dominating submatrix that is composed of a few most

important rows and columns from the original matrix. The method for

evaluating the importance of a row or column is inspired by the term

frequency-inverse document frequency in natural language processing.

The selected submatrix is recovered using an existing base matrix fac-

torization algorithm. Then, factors of the completed submatrix are used

to retrieve the factors of the whole matrix via a linear regression model.

Numerical experiments demonstrate the effectiveness of our approach

for recovering the matrix from nonuniformly distributed observations.

In addition, our framework is naturally applicable for parallel and

distributed computing, which is very encouraging for massive-sized

data sets.

Index Terms—Divide-and-conquer, linear regression, matrix

completion, matrix factorization, parallel computing.

I. I

NTRODUCTION

Matrix completion has a great number of applications in learning

systems, including collaborative ﬁltering in recommender sys-

tems [1], structure from motion in computer vision [2], global

positioning in sensor networks [3], multiclass learning in data analy-

sis [4], and deoxyribonucleic acid imputation in bioinformatics [5].

A commonly used assumption is that the unknown matrix M is

low rank or approximately low rank. Therefore, low-rank matrix

completion is a rank minimization problem as follows:

min

X∈R

m×n

Rank(X) s.t. P



(X) = P



(M) (1)

where X and M are both m × n matrices and P



is an orthogonal

projection onto the observed entries  deﬁned as P



i, j

) = M

i, j

if (i, j) ∈ , or 0 otherwise. To solve (1), two widely used

heuristic approaches are nuclear norm minimization [6], [7] and

low-rank matrix factorization [8]. Many algorithms have since been

developed following these two approaches, such as singular value

thresholding [9], accelerated proximal gradient [10], softImpute [11],

and singular value projection (SVP) [12]. Most existing algorithms

have beautiful theories to guarantee their success in recovering the

unknown entries, in particular for the algorithms based on nuclear

norm regularization [6], [7]. However, in practice, circumstances are

more complicated. First, in the era of big data, the size of most

real-world data sets is very large, and most existing algorithms are

not suitable for handling such massive-sized data sets in a single

machine and cannot be easily implemented in distributed systems.

Manuscript received December 28, 2016; revised September 29, 2017 and

January 13, 2018; accepted January 16, 2018. This work was supported by

the National Natural Science Foundation of China under Grant 61472243 and

Grant 61574089. (Corresponding author: Bo Yuan.)

The authors are with the School of Computer Science and Engineer-

ing, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:

liuquan@sjtu.org; yanbowang@nuc.edu.cn; boyuan@sjtu.edu.cn).

Color versions of one or more of the ﬁgures in this brief are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TNNLS.2018.2795581

Second, most real-world data sets do not obey the basic assumptions

on which these algorithms are established.

For the ﬁrst problem, most existing algorithms [8]–[13] require

time-expensive calculations of singular value decomposition dur-

ing each iteration. This is a huge obstacle to the utilization of

these algorithms on large-scale data sets. To tackle this size prob-

lem, it is urged to embrace parallel and distributed computing.

Some notable works have been done in this area. For example,

Divide-Factor-Combine (DFC) [14] randomly splits the original

problem into several subproblems, solves these subproblems using

existing algorithms in parallel, and then combines the results of each

subproblem to generate the ﬁnal solution. Local low rank matrix

approximation [15] also divides the original matrix into several

local submatrices and then combines them using a weighted sum.

SoftImpute-alternating least squares [16] combines nuclear norm

regularization and alternative least squares together to form an

algorithm that outperforms both; parallel computing is also used in

this algorithm.

The second problem mentioned above is that most algorithms are

designed and analyzed based on unrealistic assumptions. Most theo-

ries and algorithms assume that the observed entries are distributed

uniformly at random in the matrix [6], [7], [9], [10], [17]. Unfortu-

nately, most real-world data sets do not obey this assumption. The

famous Netﬂix Prize data set [18] is an example: 20% of the movies

receive 89.20% of the ratings and 20% of the users contribute 63.48%

of the ratings. The data set exhibits power-law distributed observa-

tions instead [19]. Power-law distribution is a common phenomenon

in the real world, for example, social networks, the World Wide Web,

and biological networks [20]. There are some remarkable works on

nonuniformly distributed matrix completion [19], [21]–[24]. Matrix

completion from information cascading [19] is the ﬁrst work on

power-law distribution. Srebro and Salakhutdinov [21] proposed a

weighted trace-norm regularization (WTR) model. In their method,

the trace norm is regularized with weight by the frequency of

rows and columns. They have shown empirically that a weighted

trace norm can lead to signiﬁcant performance improvement on

nonuniformly distributed data sets. Chen et al. [24] provided a

theoretical recovery guarantee for WTR.

A. Our Contributions

In this brief, we propose a distributed matrix factorization

approach, entitled Sample-Factor-Expand-Combine (SFEC), for

large-scale matrix completion problems. SFEC selects a few most

important rows and columns from the original matrix to compose a

small submatrix, solves the selected submatrix using a base matrix

completion algorithm and retains the result in factored form, expands

factors of the submatrix into factors of the original matrix via a

linear regression model, and ﬁnally combines the expanded factors

to recover the full large-scale matrix.

SFEC attempts to solve two important problems, massive size

and nonuniform observations. For the massive size problem, SFEC

divides the original massive-sized matrix into small submatrices that

can be recovered much easier. Furthermore, the expand step of SFEC

can be executed in parallel for each row or column, which is naturally

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38508821

粉丝: 6
资源: 951

利用少数支配者恢复大规模矩阵的矩阵分解方法

层次分析法之关联度分析

层次分析法（AHP）理论

零基础到专家：矩阵论学习全攻略

免费的防止锁屏小软件，可用于域统一管控下的锁屏机制

Python代码实现带装饰的圣诞树控制台输出

白色大气风格的设计师作品模板下载.zip

电商平台开发需求文档.doc

白色简洁风格的办公室室内设计门户网站模板下载.zip

VB+access干部档案管理系统(源代码+系统)(20246t).7z

VB+ACCESS服装专卖店管理系统设计(源代码+系统+开题报告+答辩PPT)(2024ra).7z

最新资源