DM-Midware：云计算环境下高效数据挖掘的解决方案

182 浏览量更新于2024-08-26 收藏 362KB PDF 举报

云计算作为一种高性能计算模型，正日益受到数据挖掘应用的青睐。然而，传统的并行/云计算技术的复杂性对普通用户来说是个挑战，他们难以直接表达并行计算范式，特别是在云端部署应用程序时。为了解决这一问题，本文提出了DM-Midware，一个专为在异构云环境中实现高效数据挖掘设计的中间件。 DM-Midware的核心优势在于其用户友好性。它通过提供一组用户编程API，简化了MapReduce编程的繁琐细节，让用户无需深入了解底层的并行化原理，如基于SMP（Symmetric Multi-Processing）或多核架构的并行处理。这使得即使是对并行编程不熟悉的用户也能轻松地在其云应用中实现并行计算，极大地降低了技术门槛。中间件的核心机制是基于指令的并行化方案，它能够自动将原本的串行程序转换为可支持多核心资源的并行版本，提高了性能。这种转换过程无需用户手动编写复杂的并行代码，显著减少了出错的可能性。为了优化云系统的整体效率，文中提出了一个两步调度策略。首先，它分析任务的特性，然后动态地分配计算资源，确保任务在不同硬件平台上能够充分利用潜力，从而最大化系统吞吐量。这种方法对于处理大规模数据挖掘任务尤为关键，因为它可以有效应对云环境中的动态资源分配和负载均衡问题。实验部分展示了DM-Midware在私有云环境中的实际应用，通过执行典型的数据挖掘算法，如深度学习、聚类或关联规则挖掘等，验证了其良好的可扩展性和适应性。随着数据集的增长，DM-Midware表现出稳定的性能提升，表明它能够在处理更大规模的数据挖掘任务时保持高效。 DM-Midware作为一个高效的云数据挖掘中间件，简化了并行编程，增强了云环境下数据挖掘应用的易用性和性能。它通过自动化映射器启动、基于API的资源调度以及灵活的并行化策略，为云计算时代的数据挖掘提供了一个强大且易于使用的解决方案。随着云计算的发展和大数据时代的到来，DM-Midware的前景广阔，有望在帮助企业用户快速构建和优化云上数据挖掘流程中发挥重要作用。

DM-Midware: A Middleware to Enable High

Performance Data Mining in Heterogeneous Cloud

Guoyu Ou

, Ying Liu

1,2

School of Computer and Control

University of Chinese Academy of Sciences

Research Center on Fictitious Economy and Data Science

Chinese Academy of Sciences

80 ZhongGuanCun East Road, Beijing, China 100190

ouguoyu10@mails.ucas.ac.cn, yingliu@ucas.ac.cn

Xinyu Ma

, Cheng Wang

Agilent Technologies, Inc.

Beijing, China 100102

{xin-yu_ma, zheng_wang}@agilent.com

Abstract—Cloud computing has become a popular high

performance computing model where resources are provided as

services over the Web. Users are starting to adopt cloud model in

data mining applications. However, due to the complexity of

parallel/cloud computing, it is difficult for average users to

express a parallel computing paradigm for their applications in

cloud. In order to isolate users from the complexity of

parallel/cloud programming, a middleware to enable high

performance data mining, called DM-Midware, is proposed. It

hides the details of MapReduce programming from users by

automatically launching mappers through a set of user

programming APIs. Directive-based parallelization scheme

automatically “translates” a serial program into a SMP or Multi-

core based parallel program. Heterogeneous computing

resources can be invoked to conduct parallel execution by API-

based scheme. A two-step scheduling scheme is proposed to

maximize the throughput of the cloud system. We evaluate DM-

Midware by executing a representative data mining algorithm in

a private cloud. Experimental results demonstrate good

scalability and adaptability.

Keywords—cloud computing; data mining; parallel computing

I. INTRODUCTION

A typical cloud usually consists of hundreds or thousands

of servers interconnected by some type of network connection.

As cloud is a distributed system in essence, the existing serial

data mining algorithms are not able to make full use of the

underlying hardware resources in cloud. Although many

parallel data mining algorithms have been proposed, such as

SMP-based (Symmetric Multi-Processing computer)

algorithms [1], cluster-based algorithms [2], GPU-based

algorithms [3], etc., they are either not compatible with cloud

infrastructure, or not able to scale to distributed system at all.

Having observed this problem, researchers have worked on

efficient implementations of data mining algorithms in cloud

by MapReduce programming model [4, 5, 6]. MapReduce is a

programming model for processing parallelizable problems

across huge datasets using a large number of nodes.

Parallelizing algorithms by MapReduce is challenging because

it not only requires solid knowledge in data structure,

algorithms and distributed systems, but also requires strong

programming skills. In addition, it is difficult for users to

express a parallel computing paradigm in cloud. Thus, a

specialized computational infrastructure is required to simplify

the parallel execution in cloud and isolate users from the

complexity.

Is it possible to enable users with little cloud computing

knowledge to perform data mining in cloud through a

middleware? In other words, does a common execution

characteristic or pattern exist in data mining applications

whereby we can abstract the execution pattern, and implement

a middleware to drive data mining applications in a cloud? Our

answer is yes. After a careful investigation, we find out that

data-parallelism dominates data mining applications. For

example, K-Nearest-Neighbors (KNN) classifier predicts the

class label of a novel object A by the majority vote of the class

labels of its K nearest neighbors. Massive processes/threads

can be launched, where each thread calculates the distance

between A and an object in the training set independently.

Eventually, a reduction-like operation is performed to obtain

the global K nearest neighbors of A. Other algorithms, like

decision tree, K-means, Apriori, GSP (sequence mining),

Neural Network, etc. work in the same fashion in each iteration.

Based on the above observation, we propose a middleware,

called DM-Midware, to fill in the gap between data mining and

cloud/parallel programing model. It hides the implementation

details of cloud/programming from users, and help users to

leverage heterogeneous hardware resources in a cloud,

consisting of clusters, multi-core processors, GPUs and FPGAs.

Contributions of this paper can be summarized as follows:

1) A set of user programming APIs is proposed to

automatically launching multiple MapReduce mappers with

minimum user input.

2) Two parallelization schemes are proposed: directive-

based scheme “translates” a serial program into a SMP or

Multi-core based parallel program; API-based scheme invokes

heterogeneous computing resources for parallel execution.

Rather than writing up MapReduce code or parallel code,

users only need to specify the parallel sections in comment-

like directives and provide the corresponding parameters.

3) A two-step scheduling algorithm is proposed. It ensures

that various heterogeneous computing resources in a cloud

work in a load-balanced manner. This is a highlight of

DM_Midware.

4) Hadoop Streaming APIs are adopted by DM-Midware

so that C/C++/Python/Perl programs can also be executed in

2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT)

DOI 10.1109/WI-IAT.2013.152

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38528459

粉丝: 4
资源: 974

DM-Midware：云计算环境下高效数据挖掘的解决方案

数据挖掘DM.zip

中间件分类及标准.pdf

达梦数据库DM8：通用关系型数据库

idea 打包依赖中间件运行可以在中间件中加载好DM驱动

CRISP-DM模型的数据处理过程

CRISP-DM 好处

描述人工智能、机器学习、深度学习和数据挖掘之间的关系

IllegalStateException: dbType not support : dm, url jdbc:dm://127.0.0.1:5236?schema=DB_GWXF 解决办法

dm crypt 性能 测试

图神经网络 (gnn)+数据挖掘(dm)

最新资源

dm crypt 性能测试