DM-Midware: A Middleware to Enable High
Performance Data Mining in Heterogeneous Cloud
Guoyu Ou
1
, Ying Liu
1,2
1
School of Computer and Control
University of Chinese Academy of Sciences
2
Research Center on Fictitious Economy and Data Science
Chinese Academy of Sciences
80 ZhongGuanCun East Road, Beijing, China 100190
ouguoyu10@mails.ucas.ac.cn, yingliu@ucas.ac.cn
Xinyu Ma
3
, Cheng Wang
3
3
Agilent Technologies, Inc.
Beijing, China 100102
{xin-yu_ma, zheng_wang}@agilent.com
Abstract—Cloud computing has become a popular high
performance computing model where resources are provided as
services over the Web. Users are starting to adopt cloud model in
data mining applications. However, due to the complexity of
parallel/cloud computing, it is difficult for average users to
express a parallel computing paradigm for their applications in
cloud. In order to isolate users from the complexity of
parallel/cloud programming, a middleware to enable high
performance data mining, called DM-Midware, is proposed. It
hides the details of MapReduce programming from users by
automatically launching mappers through a set of user
programming APIs. Directive-based parallelization scheme
automatically “translates” a serial program into a SMP or Multi-
core based parallel program. Heterogeneous computing
resources can be invoked to conduct parallel execution by API-
based scheme. A two-step scheduling scheme is proposed to
maximize the throughput of the cloud system. We evaluate DM-
Midware by executing a representative data mining algorithm in
a private cloud. Experimental results demonstrate good
scalability and adaptability.
Keywords—cloud computing; data mining; parallel computing
I. INTRODUCTION
A typical cloud usually consists of hundreds or thousands
of servers interconnected by some type of network connection.
As cloud is a distributed system in essence, the existing serial
data mining algorithms are not able to make full use of the
underlying hardware resources in cloud. Although many
parallel data mining algorithms have been proposed, such as
SMP-based (Symmetric Multi-Processing computer)
algorithms [1], cluster-based algorithms [2], GPU-based
algorithms [3], etc., they are either not compatible with cloud
infrastructure, or not able to scale to distributed system at all.
Having observed this problem, researchers have worked on
efficient implementations of data mining algorithms in cloud
by MapReduce programming model [4, 5, 6]. MapReduce is a
programming model for processing parallelizable problems
across huge datasets using a large number of nodes.
Parallelizing algorithms by MapReduce is challenging because
it not only requires solid knowledge in data structure,
algorithms and distributed systems, but also requires strong
programming skills. In addition, it is difficult for users to
express a parallel computing paradigm in cloud. Thus, a
specialized computational infrastructure is required to simplify
the parallel execution in cloud and isolate users from the
complexity.
Is it possible to enable users with little cloud computing
knowledge to perform data mining in cloud through a
middleware? In other words, does a common execution
characteristic or pattern exist in data mining applications
whereby we can abstract the execution pattern, and implement
a middleware to drive data mining applications in a cloud? Our
answer is yes. After a careful investigation, we find out that
data-parallelism dominates data mining applications. For
example, K-Nearest-Neighbors (KNN) classifier predicts the
class label of a novel object A by the majority vote of the class
labels of its K nearest neighbors. Massive processes/threads
can be launched, where each thread calculates the distance
between A and an object in the training set independently.
Eventually, a reduction-like operation is performed to obtain
the global K nearest neighbors of A. Other algorithms, like
decision tree, K-means, Apriori, GSP (sequence mining),
Neural Network, etc. work in the same fashion in each iteration.
Based on the above observation, we propose a middleware,
called DM-Midware, to fill in the gap between data mining and
cloud/parallel programing model. It hides the implementation
details of cloud/programming from users, and help users to
leverage heterogeneous hardware resources in a cloud,
consisting of clusters, multi-core processors, GPUs and FPGAs.
Contributions of this paper can be summarized as follows:
1) A set of user programming APIs is proposed to
automatically launching multiple MapReduce mappers with
minimum user input.
2) Two parallelization schemes are proposed: directive-
based scheme “translates” a serial program into a SMP or
Multi-core based parallel program; API-based scheme invokes
heterogeneous computing resources for parallel execution.
Rather than writing up MapReduce code or parallel code,
users only need to specify the parallel sections in comment-
like directives and provide the corresponding parameters.
3) A two-step scheduling algorithm is proposed. It ensures
that various heterogeneous computing resources in a cloud
work in a load-balanced manner. This is a highlight of
DM_Midware.
4) Hadoop Streaming APIs are adopted by DM-Midware
so that C/C++/Python/Perl programs can also be executed in
2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT)
978-1-4799-2902-3/13 $31.00 © 2013 IEEE
DOI 10.1109/WI-IAT.2013.152
70