没有合适的资源?快使用搜索试试~ 我知道了~
首页利用少数支配者恢复大规模矩阵的矩阵分解方法
利用少数支配者恢复大规模矩阵的矩阵分解方法
0 下载量 157 浏览量
更新于2024-08-26
收藏 1.46MB PDF 举报
本文是一篇研究论文,标题为《来自几个支配者的矩阵分解》,作者是Quan Liu、Yanbo Wang和Bo Yuan。论文主要探讨了Pareto原理在矩阵完成问题中的应用,该原理指出大部分结果是由少数主导因素引起的。现实世界的数据集往往表现出非均匀观测分布,然而大多数现有的算法都是基于均匀观测设计的,这可能并不适用于实际场景。 论文的核心贡献在于提出了一种新颖的矩阵分解方法,目的是从原始矩阵中恢复大规模矩阵,特别是通过识别并利用包含最重要行和列的主导子矩阵。这个过程的关键在于评估行或列的重要性,作者借鉴了自然语言处理中的术语频率-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)概念,这是一种衡量词语在文档集合中独特性的重要度量。 具体步骤如下:首先,通过一种已有的基础矩阵分解算法来恢复出这个重要的主导子矩阵。然后,利用完成后的子矩阵的因子,推断出整个矩阵的因子。这种方法的优势在于,即使数据集中存在非均匀分布,也能有效地聚焦于少数关键部分,提高矩阵分解的准确性和效率。 通过这种方法,论文旨在解决大规模矩阵分解时存在的问题,特别是当数据中存在显著的主导元素时,可以更有效地提取出隐藏的模式和结构。这种针对非均匀观测设计的算法,有望在实际的推荐系统、数据挖掘以及机器学习等领域中发挥重要作用,提升模型的预测能力和解释性。
资源详情
资源推荐
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Matrix Factorization From a Few Dominators
Quan Liu , Yanbo Wang, and Bo Yuan
Abstract— The Pareto principle states that most effects are the result of
a few dominating causes. This principle also fits most matrix completion
problems. In practice, most real-world data sets exhibit nonuniformly
distributed observations. Unfortunately, most existing algorithms are
designed based on uniformly distributed observations. In this brief,
we propose a matrix factorization approach to recover a large-scale
matrix from a dominating submatrix that is composed of a few most
important rows and columns from the original matrix. The method for
evaluating the importance of a row or column is inspired by the term
frequency-inverse document frequency in natural language processing.
The selected submatrix is recovered using an existing base matrix fac-
torization algorithm. Then, factors of the completed submatrix are used
to retrieve the factors of the whole matrix via a linear regression model.
Numerical experiments demonstrate the effectiveness of our approach
for recovering the matrix from nonuniformly distributed observations.
In addition, our framework is naturally applicable for parallel and
distributed computing, which is very encouraging for massive-sized
data sets.
Index Terms—Divide-and-conquer, linear regression, matrix
completion, matrix factorization, parallel computing.
I. I
NTRODUCTION
Matrix completion has a great number of applications in learning
systems, including collaborative filtering in recommender sys-
tems [1], structure from motion in computer vision [2], global
positioning in sensor networks [3], multiclass learning in data analy-
sis [4], and deoxyribonucleic acid imputation in bioinformatics [5].
A commonly used assumption is that the unknown matrix M is
low rank or approximately low rank. Therefore, low-rank matrix
completion is a rank minimization problem as follows:
min
X∈R
m×n
Rank(X) s.t. P
(X) = P
(M) (1)
where X and M are both m × n matrices and P
is an orthogonal
projection onto the observed entries defined as P
(M
i, j
) = M
i, j
if (i, j) ∈ , or 0 otherwise. To solve (1), two widely used
heuristic approaches are nuclear norm minimization [6], [7] and
low-rank matrix factorization [8]. Many algorithms have since been
developed following these two approaches, such as singular value
thresholding [9], accelerated proximal gradient [10], softImpute [11],
and singular value projection (SVP) [12]. Most existing algorithms
have beautiful theories to guarantee their success in recovering the
unknown entries, in particular for the algorithms based on nuclear
norm regularization [6], [7]. However, in practice, circumstances are
more complicated. First, in the era of big data, the size of most
real-world data sets is very large, and most existing algorithms are
not suitable for handling such massive-sized data sets in a single
machine and cannot be easily implemented in distributed systems.
Manuscript received December 28, 2016; revised September 29, 2017 and
January 13, 2018; accepted January 16, 2018. This work was supported by
the National Natural Science Foundation of China under Grant 61472243 and
Grant 61574089. (Corresponding author: Bo Yuan.)
The authors are with the School of Computer Science and Engineer-
ing, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:
liuquan@sjtu.org; yanbowang@nuc.edu.cn; boyuan@sjtu.edu.cn).
Color versions of one or more of the figures in this brief are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2018.2795581
Second, most real-world data sets do not obey the basic assumptions
on which these algorithms are established.
For the first problem, most existing algorithms [8]–[13] require
time-expensive calculations of singular value decomposition dur-
ing each iteration. This is a huge obstacle to the utilization of
these algorithms on large-scale data sets. To tackle this size prob-
lem, it is urged to embrace parallel and distributed computing.
Some notable works have been done in this area. For example,
Divide-Factor-Combine (DFC) [14] randomly splits the original
problem into several subproblems, solves these subproblems using
existing algorithms in parallel, and then combines the results of each
subproblem to generate the final solution. Local low rank matrix
approximation [15] also divides the original matrix into several
local submatrices and then combines them using a weighted sum.
SoftImpute-alternating least squares [16] combines nuclear norm
regularization and alternative least squares together to form an
algorithm that outperforms both; parallel computing is also used in
this algorithm.
The second problem mentioned above is that most algorithms are
designed and analyzed based on unrealistic assumptions. Most theo-
ries and algorithms assume that the observed entries are distributed
uniformly at random in the matrix [6], [7], [9], [10], [17]. Unfortu-
nately, most real-world data sets do not obey this assumption. The
famous Netflix Prize data set [18] is an example: 20% of the movies
receive 89.20% of the ratings and 20% of the users contribute 63.48%
of the ratings. The data set exhibits power-law distributed observa-
tions instead [19]. Power-law distribution is a common phenomenon
in the real world, for example, social networks, the World Wide Web,
and biological networks [20]. There are some remarkable works on
nonuniformly distributed matrix completion [19], [21]–[24]. Matrix
completion from information cascading [19] is the first work on
power-law distribution. Srebro and Salakhutdinov [21] proposed a
weighted trace-norm regularization (WTR) model. In their method,
the trace norm is regularized with weight by the frequency of
rows and columns. They have shown empirically that a weighted
trace norm can lead to significant performance improvement on
nonuniformly distributed data sets. Chen et al. [24] provided a
theoretical recovery guarantee for WTR.
A. Our Contributions
In this brief, we propose a distributed matrix factorization
approach, entitled Sample-Factor-Expand-Combine (SFEC), for
large-scale matrix completion problems. SFEC selects a few most
important rows and columns from the original matrix to compose a
small submatrix, solves the selected submatrix using a base matrix
completion algorithm and retains the result in factored form, expands
factors of the submatrix into factors of the original matrix via a
linear regression model, and finally combines the expanded factors
to recover the full large-scale matrix.
SFEC attempts to solve two important problems, massive size
and nonuniform observations. For the massive size problem, SFEC
divides the original massive-sized matrix into small submatrices that
can be recovered much easier. Furthermore, the expand step of SFEC
can be executed in parallel for each row or column, which is naturally
2162-237X © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
下载后可阅读完整内容,剩余6页未读,立即下载
weixin_38508821
- 粉丝: 6
- 资源: 951
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- OptiX传输试题与SDH基础知识
- C++Builder函数详解与应用
- Linux shell (bash) 文件与字符串比较运算符详解
- Adam Gawne-Cain解读英文版WKT格式与常见投影标准
- dos命令详解:基础操作与网络测试必备
- Windows 蓝屏代码解析与处理指南
- PSoC CY8C24533在电动自行车控制器设计中的应用
- PHP整合FCKeditor网页编辑器教程
- Java Swing计算器源码示例:初学者入门教程
- Eclipse平台上的可视化开发:使用VEP与SWT
- 软件工程CASE工具实践指南
- AIX LVM详解:网络存储架构与管理
- 递归算法解析:文件系统、XML与树图
- 使用Struts2与MySQL构建Web登录验证教程
- PHP5 CLI模式:用PHP编写Shell脚本教程
- MyBatis与Spring完美整合:1.0.0-RC3详解
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功