Spark Shuffle过程的高效压缩算法决策方法

162 浏览量更新于2024-08-26 收藏 383KB PDF 举报

随着大数据技术的飞速发展，Apache Spark作为一种广泛应用于分布式计算的框架，其Shuffle过程在数据交换和划分阶段扮演着核心角色。Shuffle是Spark任务中的关键操作，负责将数据从一个任务分区发送到另一个任务，这直接影响了系统的性能，包括CPU利用率、I/O速率和网络传输效率。然而，Spark默认的压缩算法配置可能并不适用于所有场景，因为不同的应用程序对压缩的需求和性能影响各不相同。传统的Shuffle过程中，由于数据量庞大，数据传输、I/O读写以及CPU处理压力巨大。Spark为用户提供了多种压缩算法选择，如Zlib、Snappy和LZ4等，这些算法各有优劣，压缩率和解压速度会影响整体性能。然而，由于用户通常采用默认配置，未能针对具体应用进行优化，可能导致性能损失。针对这一问题，本文提出了一种新颖的Shuffle过程压缩算法决策方法。该方法首先构建了一个基于Spark Shuffle流程的成本优化模型，这个模型考虑了各种性能参数，如CPU使用率、I/O操作频率和网络传输需求，以预测不同压缩算法在特定应用场景下的性能表现。通过实验数据分析，模型能够达到58.3%的预测精度，这意味着它能有效地帮助用户选择最合适的压缩配置。通过实施优化后的压缩策略，论文作者展示了所提出的成本优化模型能够显著提升Spark Shuffle流程的性能，平均提升了48.9%。这意味着在保证数据处理效率的同时，降低了资源消耗，从而实现了大数据处理的高效和可持续性。这种方法不仅对于Spark用户来说是一个重要的优化工具，也为其他分布式大数据计算框架提供了一种通用的性能优化策略，推动了整个行业的性能提升和效率改进。

2017 IEEE International Conference on Big Data (BIGDATA)

A Novel Compression Algorithm Decision Method for Spark Shuffle Process

Shanshan Huang

1,2

, Jungang Xu

, Renfeng Liu

, and Husheng Liao

School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China

Faculty of Information Technology, Beijing University of Technology, Beijing, China

Email:huangss118@emails.bjut.edu.cn, xujg@ucas.ac.cn, liurenfeng16@mails.ucas.ac.cn, liaohs@bjut.edu.cn

Abstract—With the wide application of Spark big data platform,

some problems in practical application are exposed, and one of

the main problems is performance optimization. The Shuffle

module of Spark is one of the core modules of Spark, and it is

also an important module of some other distributed big data

computing frameworks. The design of Shuffle module is the key

factor that directly determines the performance of big data

computing framework. The main optimization parameters of

Shuffle process involve the CPU utilization, I/O literacy rate,

network transmission rate, and one of these factors is likely to

be the bottleneck during the execution of application. The

network data transmission time consumption, I/O read and

write time, and the CPU utilization are closely related with the

size of the data processing. As a result, Spark provides

compression configuration options and different compression

algorithms for users to select. Different compression algorithms

have different effects in compression rate and compression

ratio, but the default configuration is usually selected by all

users even though they run different applications, so the optimal

configuration cannot be achieved. In order to achieve the

optimal configuration of compression algorithm for the Shuffle

process, one cost optimization model for Spark Shuffle process

is proposed in this paper, which enables users to get the best

compression configuration before application execution. The

experimental results show that the prediction model for

compression configuration has an accuracy of 58.3%, and the

proposed cost optimization model can improve the performance

by 48.9%.

Keywords-Spark; Shuffle process; compression

configuration; cost model

I. INTRODUCTION

In recent years, with the progress of science and

technology, from enterprise operation to various kinds of

technologies and equipment, a large number of data sources

are generating valuable data streams all the time. International

data corporation (IDC) predicted that in 2025, the world's total

data volume will rise to 163ZB [1]. The data which impacts

on life is essential for the normal life and operation of

consumers, governments and enterprises. Consumers and

enterprises continue to generate, share and access data

between different devices and clouds, and the data will grow

faster than ever. Therefore, how to improve the efficiency of

obtaining important information from a large number of data

becomes the inevitable research direction.

With the deepening of the study, the researchers find that

in many areas, the requirements for data processing speed and

complexity are also gradually increasing, for example, in

addition to simple query, many machine learning algorithms

and chart analysis algorithm that need multiple iterations are

widely used, at the same time, some real-time data streaming

analysis algorithms that can ensure timely access to

information are proved to be effective. In order to solve these

problems, the researchers put forward a kind of fast and

general data processing platform on large cluster — Spark,

which can meet most of data processing demands and it also

has high extensibility [2].

Spark is a big data analysis platform developed by AMP

Lab, University of California, Berkeley, and which bring out

the concept of RDD (Resilient Distributed Dataset, RDD) [3].

Spark can not only handle batch data, but also can support data

warehouse, stream processing, graph computation and other

paradigms, which is one multifunctional platform in the big

data system domain [4]. Due to its excellent data processing

capacity and high scalability, many corporations have already

promoted and applied Spark into actual production. For

example, Yahoo used Spark in Audience Expansion for more

accurate search of target users through advertising, Baidu

launched a large data processing product based Spark called

Baidu MapReduce (BMR), etc.

With the wide application of Spark platform, some

problems are exposed. One of the main problems is the

performance optimization. The implementation environment

of the big data platform is extremely complex, and it is

difficult to achieve the theoretical performance peak due to the

multi-level synthesis effect of the underlying hardware,

architecture, operating system, Spark itself and the application

program written by the user. So how to optimize the

performance of Spark is one problem that is worthy of

research.

Spark offers more than 180 configuration parameters for

users to adjust. This is also the simplest and most effective

way for user to optimize their applications. The Shuffle

module is one of the core modules of Spark, and the Shuffle

process involves more than 50 configuration parameters.

Therefore, the configuration of Shuffle process is the key

factor that directly determines the performance of Spark. The

main configuration parameters of Shuffle process involve

CPU utilization, I/O literacy rate, network transmission rate,

among them, we can know that the time consumption of

network data transmission, read and write I/O time and CPU

share are closely connected with the size of data. Spark

provides compression configuration options and different

compression algorithms for user to select. Different

compression algorithms have different compression rates and

下载后可阅读完整内容，剩余9页未读，立即下载

weixin_38557896

粉丝: 0
资源: 971

Spark Shuffle过程的高效压缩算法决策方法

电子-一种新型多头电火花机

电子-一种电火花堆焊方法

一种新型的电火花放电加工间隙状态控制方法 (2012年)

电子-一种新型数控电火花线切割机床

电子-一种电火花线切割用工作液的配制方法

一种电火花磨削磨轮主轴装置及加工方法.zip

一种电火花加工设备放电间隙控制方法_yzwindos版

一种新型发动机失火检测方法的探讨

电子政务-一种铣削与电火花复合加工方法.zip

电子-一种电火花线切割加工铝合金的工艺改进方法

最新资源