A covariance-free iterative algorithm for distributed principal component
analysis on vertically partitioned data
Yue-Fei Guo
a,
n
, Xiaodong Lin
b
, Zhou Teng
a
, Xiangyang Xue
a
, Jianping Fan
c
a
School of Computer Science, Fudan University, Shanghai, China
b
Department of Management and Information Systems, Rutgers University, NJ 08854, USA
c
Department of Computer Science, University of North Carolina at Charlotte, NC 28223, USA
article info
Article history:
Received 1 September 2010
Received in revised form
19 August 2011
Accepted 4 September 2011
Available online 10 September 2011
Keywords:
Distributed principal component analysis
Covariance-free
Vertical dimension partition
abstract
In this paper, a covariance-free iterative algorithm is dev eloped to achieve distributed principal
component analysis on high-dimensional data sets that are vertically partitioned. We have proved that
our iterative algo rithm converges monotonously with an exponential rate. Different from existing
techniques that aim at approximating the global PCA, our covariance-free iterative distributed PCA
(CIDPCA) algorithm can estimate the principal components directly without computing the sample
covariance matrix. Therefore a significant reduction on transmission costs can be achieved. Further-
more, in comparison to existing distributed PCA techniques, CIDPCA can provide more accurate
estimations of the principal components and classification results. We have demonstrated the superior
performance of CIDPCA through the studies of multiple real-world data sets.
& 2011 Elsevier Ltd. All rights reserved.
1. Introduction
Extracting more representative feature dimensions (i.e.,
dimension reduction) is one of the most fundamental compo-
nents for modern data analysis, particularly when dealing with
massive and high-dimensional data sets. Principal component
analysis (PCA) is a well-established technique to achieve this goal
[3–6,8]. By selecting top k eigenvectors with larger eigenvalues
for subspace approximation, PCA can provide a lower dimension
representation to reveal the underlying latent structures of the
complex data sets. Such PCA-based representation, which is
optimal in a least square sense, can filter out abundant noises
that are inherent to the data. Due to its simplicity, interpretability
and model independency, PCA has become the standard techni-
que for practitioners across different fields to extract the relevant
information from massive data sets [11,12,14,16–18].
In recent year, we have seen more and more data being
collected and distributed across multiple sites. The rapid devel-
opment of computer clusters and grid computing technologies
can also allow us to perform more effective computation on these
distributed data sources. To avoid the bottleneck of the comput-
ing power and the computer memory of single machine, the
computation tasks on the massive data sets are separated into
many smaller tasks and distributed on multiple machines to
facilitate more scalable computation [13,20]. The traditional PCA
techniques, which are based on the eigen-decomposition of the
global sample covariance matrix, are no longer feasible for such
distributed environment. Thus it is very important to develop
more effective approaches for achieving distributed PCA (D-PCA).
In the case of distributed environment, we can also obtain the
accurate solution if all data that is distributed in different sites is
transferred to one site. However, the transmission cost is very
expensive. One of the main challenge for supporting D-PCA is the
need to reduce the transmission cost while maintaining the
accuracy of the principal components when they are computed
in a distributed setting. To achieve this goal, some existing D-PCA
approaches focus on computing and integration of local summary
statistics to approximate the global PCA [6,11,12]. These techni-
ques have achieved significant reduction on transmission cost.
However, the improvement on the transmission cost is coupled
with the sacrifice on the computation accuracy, and achieving
high computation accuracy is another crucial factor for practical
distributed computation. Therefore, there is an urgent demand to
design more effective D-PCA algorithms that can consider the
local constraints (e.g., bandwidth, latency, reliability, etc.) for each
local machine while attempting to generate more accurate global
PCA results.
In this paper, a covariance-free iterative D-PCA (CIDPCA)
algorithm is developed to reduce transmission cost while main-
taining a high level of accuracy on principal component analysis in
a distributed environment. Our approach is covariance-free and
approximates the eigenvalues and eigenvectors directly, thus it is
Contents lists available at SciVerse ScienceDirect
journal homepage: www.elsevier.com/locate/pr
Pattern Recognition
0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2011.09.002
n
Corresponding author.
E-mail address: yfguo@fudan.edu.cn (Y.-F. Guo).
Pattern Recognition 45 (2012) 1211–1219