2168-7161 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TCC.2015.2481381, IEEE Transactions on Cloud Computing
JOURNAL OF L
A
T
E
X CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1
Parallel RMCLP Classification Algorithm and Its
Application on the Medical Data
Zhiquan Qi, Yingjie Tian, Yong Shi, Senior Member, IEEE,, and Vassil Alexandrov
Abstract—To make better to use the cloud computing technology, and to overcome the computing and storage requirements which
increase rapidly with the number of training samples, in this paper, a new parallel algorithm is proposed - Parallel Regularized
Multiple-Criteria Linear Programming (PRMCLP) algorithm - The RMCLP model is converted into a unconstrained optimization
problem, and then, in the parallel version, it is split into several task, where each part is mapped and computed on a separate
processor. In this approach enables us to obtain efficiently the final optimization solution of the whole classification problem. At last, we
apply this algorithm into Medical data classification. All experiments show that our method and approach greatly increases the training
speed of RMCLP in the parallel case.
Index Terms—PRMCLP, parallel algorithm, data mining
✦
1 INTRODUCTION
N
Owadays, the Big Data bring an unprecedented op-
portunities a nd challenges [1], [2], [3]. On the ohter
hand, the amount of Data is becoming larger and more
complex, which causes us to be inventing novel algorithms
to process efficiently the vast ocean of information. This
is true for example, while solving important management
problems and that we need to gain enough knowledge t o
support our decision. One of the most important reasons is
that we still do not have the capabilities to extract much
useful knowledge from Big Dat a. As a result, more and
more reserachers begin to research and introduce new data
mining m e thods and techniques to deal with the increasing
complex data. In this pape r, we design a parallel algorithm
based on Regularized Mu ltiple-Criteria Linear Program-
ming (RMCLP) [4] to further accelerate the training speed,
which will provide a possible way to tackle more efficiently
the Big Data type of problems.
In order to accelerate the machine learning process, par-
allelizing classificat ion algorithms is one of the key and
basic problems in the era of big data. Support Vector Ma-
chine(SVM) ( [5], [6], [7]) is one of the most popular classifi-
cation me thods. However, the idea of applying optimization
techniques to solve the classification problem can be dated
back to more than 70 ye ars ago when linear discriminant
analysis (LDA) ( [8]) was first proposed in 1936.
In [9] Mangasarian has proposed a similar model with
SVM us ing the large margin idea in 1960’s. From 1980s to
1990s , Glover proposed a number of linear programming
models to solve discriminant problems with a small sample
size of data ( [10], [11]). Other classification models also can
be found in ( [12], [13], [14], [15], [16]). Recently, Shi and his
colleagues( [17]) extend Glover’s method into classificat ion
Zhiquan Qi, Yingj ie Tian (the corresponding author) and Yong Shi are with
the Research Center on Fictitious Economy and D ata Science, and with
Key Laboratory of Big Data Mining and Knowledge management, Chinese
Academy of Sciences, Beijing 100190, China. Vassil Alexandrov is with
ICREA and Barcelona Supercomputing Center, C/Jordi Gi rona, 29, Edifici
Nexus II, E-08034 Ba rcelona, Spain
via Multiple Criteria Linear Programming (MCLP), and then
various improved algorithms were proposed one after the
other ( [4], [18], [19], [20], [21], [22], [23]). These mat he-
matical programming approaches to classification have been
applied to handle many real world data mining problems,
such as credit card portfolio management ( [24], [25], [26]),
bioinformatics ( [27]), information intrusion and detection (
[28]), firm bankruptcy ( [29]), and etc.
In order to parallelize the classification algorithm, there
are usually two strategies employed: 1) divide-and-conquer
or 2) parallelization of the s e rial algorithm. In the case of
the first strategy, a large scale problem, can be divided it
into sev e ral sub-problem, which are mutually independent
and ha ve same form as the primal problem. Then, these sub-
problems are solved recursively. The combining the results,
the solution of the primal problem is obtained. We can find
such typical me thods in [30], [31], [32]. The second strategy
is based on the parallel nature of the algorithm itself. Several
typical methods include [33], [34], [35], [3 6].
In this paper, the focus is on the RMCLP, and the de-
signed and proposed Parallel version of RMCLP algorithm
(PRMCLP). In order to overcome the compute and storage
requirements that increase rapidly with the number of train-
ing sample, the second strategy is adopted, inspire by some
findings in [37].
Firstly, RMCLP model is converted into a unconstrained
optimization p roblem, a nd then split into several parts,
which are then m apped on and the computation is per-
formed on p processors in parallel. After that, t he results
obtained by each processors are analyzed and summarized,
and the results of the sub-p roblems are taken as a pa-
rameterized input in the next step. This loop is executed
until the optimizat ion solution of the whole classification
problem is obtained by satisfying the termination condition.
Experiments using public datas e ts show that our method
greatly increases the training speed of RMCLP while using
p processors.
The remaining parts of the paper are organized as follows.