VPCH：Hadoop环境中的哈希算法，提升MapReduce负载均衡性能

PDF格式 | 355KB | 更新于2024-08-28 | 67 浏览量 | 举报

VPCH（Virtual Partition Consistent Hashing）是一种专门针对Hadoop环境设计的一致性哈希算法，旨在解决MapReduce（MR）在大规模数据处理中的负载均衡问题。MapReduce是一种广泛应用的分布式计算模型，它将复杂的任务分解为一系列简单的映射和规约操作，通过数据集群进行并行处理。然而，传统的哈希函数在任务分配时可能由于随机性导致数据分布不均匀，形成所谓的"数据倾斜"现象，这会严重影响系统的性能。 VPCH算法的核心思想在于通过创建虚拟分区来改进一致性哈希。它将数据集划分为多个虚拟分区，每个分区都有一个固定的哈希槽。当需要分配任务时，算法不仅考虑任务本身的键值，还根据任务与虚拟分区的关联关系进行计算，从而实现了更精细的负载均衡。这种设计避免了随机模运算可能导致的数据分布不均衡，减少了“热点”节点的存在，提高了系统整体的资源利用率。在实际应用中，VPCH算法特别关注MapReduce过程的reduce阶段，这是整个任务执行流程中的关键部分。通过VPCH，任务的分配更加合理，从而减少了执行时间，特别是在使用或不使用MJR（mapreduce.job.reduce.slowstart.completedmaps）参数集的情况下，效果更为显著。这意味着使用VPCH算法可以提升整个Hadoop环境下的数据处理效率，降低延迟，提高系统的吞吐量和响应速度。总结来说，VPCH算法为MapReduce环境中的负载均衡问题提供了一种创新解决方案，通过引入虚拟分区和一致性哈希，优化了任务的分配策略，使得数据处理在大规模集群中更加高效。这对于处理大数据集和应对现代数据中心的高并发需求具有重要意义。

VPCH: A Consistent Hashing Algorithm for Better Load Balancing in a Hadoop

Environment

Qi Liu*, Weidong Cai, Jian Shen, Baowei Wang,

Zhangjie Fu

Department of Computer and Software

Nanjing University of Information Science and

Technology

Nanjing, China

qrankl@163.com

Nigel Linge

The University of Salford

Salford, Greater Manchester, UK

n.linge@salford.ac.uk

Abstract—MapReduce (MR) is a popular programming model

for the purposes of processing large data sets among data

clusters or grids, e.g. a Hadoop environment. Load balancing

as a key factor affecting the performance of map resource

distribution, has recently gained high concerns to optimize.

Current MR processes in the realization of distributing tasks

to clusters use hashing with random modulo operations, which

can lead to uneven data distribution and inclined loads,

thereby obstruct the performance of the entire distribution

system. In this paper, a virtual partition consistent hashing

(VPCH) algorithm is proposed for the reduce stage of MR

processes, in order to achieve such a trade-off on job

allocation. According to the results, using our method can

reduce task execution time with or without MJR

(mapreduce.job.reduce.slowstart.completedmaps) parameter

set.

Keywords- Map Reduce; Load Balancing; Consistent

Hashing;

I. INTRODUCTION

In recent years, with the explosive growth of data and its

processes in the Internet, cloud computing has been widely

studied in both academia and industry, in order to provide

users such a distributed system with on-demand services,

computing abilities and storage resources. Proposed by

Google in 2004, MapReduce (MR) [1] has become the most

popular distributed computing model used in a cloud

environment, where large-scale datasets can be

handled/processed using map and reduce procedures in the

cloud infrastructure transparently.

Besides map and reduce, other internal processes

integrated into MR have also been analyzed and optimized

[2]. Taking the partition procedure as an instance, Hadoop

uses a hash function with modulo operations to calculate

partition keys, such that same number of packets maintained

in each group can be guaranteed. However, such a method

can cause tilted allocation of tasks to different reducers. In

this paper, a new scheme called VPCH is proposed to

implement virtual partitioning. Such a scheme can ensure the

load of each reducer during the reduce phase is relatively

balanced. The total execution time is actually reduced due to

even distribution of task to each reducer.

A practical Hadoop environment has been implemented

in our laboratory, retaining both original and refined

partitioning schemes so users can select and compare either

of them for their actual tasks and/or performance evaluation.

According to our results, the VPCH algorithm has depicted

shorten execution time on both reduce phases and entire task

completion.

The rest sections of the paper are organized as followed.

Related work is given in Section II, followed by Section III,

where our load balancing approach is detailed. In Section IV,

testing environment and corresponding scenarios are design

for the verification and evaluation of the VPCH approach.

Finally, conclusion and future work on load balancing

strategies in a cloud platform are discussed in Section V.

II. R

ELATED WORK

There are various approaches to obtain work load in the

data node of a MR system. Authors in [3] tried to repartition

tasks from slow workers to faster ones by monitoring real-

time Map and Reduce jobs to ensure that all available nodes

can finish the jobs at the same time. This method can handle

all kinds of load deflection, but it changes Hadoop greatly

with complicated modification/configuration, as well as extra

network cost on the redistribution of tasks.

Some studies address the issues of mapping sets of tasks

onto sets of processors according to context information (e.g.

execution history [4], sampled data skew [5], etc.), such that

overall execution time can be minimized. However, these

methods failed to have their models verified via a practical

cloud platform, which consequently ignored corresponding

configuration in a cloud computing environment, e.g.

number of reducers.

Neural Network (NN) algorithms have been employed in

a cloud. An adaptively partition algorithm called HAP was

proposed in [6], where reduce jobs can be distributed based

on estimated work threshold using SVM in a heterogeneous

environment. Extra cost on execution time, on the other hand

is consumed when splitting and merging input K-V chains at

the reduce phase. In addition, training time of these models

needed to be taken into account as well.

2015 Third International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2015.21

2015 Third International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2015.21

2015 Third International Conference on Advanced Cloud and Big Data

DOI 10.1109/CBD.2015.21

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38586186

粉丝: 9

VPCH：Hadoop环境中的哈希算法，提升MapReduce负载均衡性能

C++Volatile关键字[参考].pdf

SH367309锂电池BMS芯片：放电状态检测与充电控制

cole_02_0507.pdf

工程硕士开题报告：无线传感器网络路由技术及能量优化LEACH协议研究

【东海期货-2025研报】东海贵金属周度策略：金价高位回落，阶段性回调趋势初现.pdf

图像数据处理工具+数据(帮助用户快速划分数据集并增强图像数据集。通过自动化数据处理流程，简化了深度学习项目的数据准备工作)

diminico_02_0709.pdf

agenda_3cd_01_0716.pdf

A课件Python全栈开发线下班.zip

diminico_02_1108.pdf

最新资源