2017 IEEE International Conference on Big Data (BIGDATA)
978-1-5386-2715-0/17/$31.00 ©2017 IEEE 2931
A Novel Compression Algorithm Decision Method for Spark Shuffle Process
Shanshan Huang
1,2
, Jungang Xu
1
, Renfeng Liu
1
, and Husheng Liao
2
1
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China
2
Faculty of Information Technology, Beijing University of Technology, Beijing, China
Email:huangss118@emails.bjut.edu.cn, xujg@ucas.ac.cn, liurenfeng16@mails.ucas.ac.cn, liaohs@bjut.edu.cn
Abstract—With the wide application of Spark big data platform,
some problems in practical application are exposed, and one of
the main problems is performance optimization. The Shuffle
module of Spark is one of the core modules of Spark, and it is
also an important module of some other distributed big data
computing frameworks. The design of Shuffle module is the key
factor that directly determines the performance of big data
computing framework. The main optimization parameters of
Shuffle process involve the CPU utilization, I/O literacy rate,
network transmission rate, and one of these factors is likely to
be the bottleneck during the execution of application. The
network data transmission time consumption, I/O read and
write time, and the CPU utilization are closely related with the
size of the data processing. As a result, Spark provides
compression configuration options and different compression
algorithms for users to select. Different compression algorithms
have different effects in compression rate and compression
ratio, but the default configuration is usually selected by all
users even though they run different applications, so the optimal
configuration cannot be achieved. In order to achieve the
optimal configuration of compression algorithm for the Shuffle
process, one cost optimization model for Spark Shuffle process
is proposed in this paper, which enables users to get the best
compression configuration before application execution. The
experimental results show that the prediction model for
compression configuration has an accuracy of 58.3%, and the
proposed cost optimization model can improve the performance
by 48.9%.
Keywords-Spark; Shuffle process; compression
configuration; cost model
I. INTRODUCTION
In recent years, with the progress of science and
technology, from enterprise operation to various kinds of
technologies and equipment, a large number of data sources
are generating valuable data streams all the time. International
data corporation (IDC) predicted that in 2025, the world's total
data volume will rise to 163ZB [1]. The data which impacts
on life is essential for the normal life and operation of
consumers, governments and enterprises. Consumers and
enterprises continue to generate, share and access data
between different devices and clouds, and the data will grow
faster than ever. Therefore, how to improve the efficiency of
obtaining important information from a large number of data
becomes the inevitable research direction.
With the deepening of the study, the researchers find that
in many areas, the requirements for data processing speed and
complexity are also gradually increasing, for example, in
addition to simple query, many machine learning algorithms
and chart analysis algorithm that need multiple iterations are
widely used, at the same time, some real-time data streaming
analysis algorithms that can ensure timely access to
information are proved to be effective. In order to solve these
problems, the researchers put forward a kind of fast and
general data processing platform on large cluster — Spark,
which can meet most of data processing demands and it also
has high extensibility [2].
Spark is a big data analysis platform developed by AMP
Lab, University of California, Berkeley, and which bring out
the concept of RDD (Resilient Distributed Dataset, RDD) [3].
Spark can not only handle batch data, but also can support data
warehouse, stream processing, graph computation and other
paradigms, which is one multifunctional platform in the big
data system domain [4]. Due to its excellent data processing
capacity and high scalability, many corporations have already
promoted and applied Spark into actual production. For
example, Yahoo used Spark in Audience Expansion for more
accurate search of target users through advertising, Baidu
launched a large data processing product based Spark called
Baidu MapReduce (BMR), etc.
With the wide application of Spark platform, some
problems are exposed. One of the main problems is the
performance optimization. The implementation environment
of the big data platform is extremely complex, and it is
difficult to achieve the theoretical performance peak due to the
multi-level synthesis effect of the underlying hardware,
architecture, operating system, Spark itself and the application
program written by the user. So how to optimize the
performance of Spark is one problem that is worthy of
research.
Spark offers more than 180 configuration parameters for
users to adjust. This is also the simplest and most effective
way for user to optimize their applications. The Shuffle
module is one of the core modules of Spark, and the Shuffle
process involves more than 50 configuration parameters.
Therefore, the configuration of Shuffle process is the key
factor that directly determines the performance of Spark. The
main configuration parameters of Shuffle process involve
CPU utilization, I/O literacy rate, network transmission rate,
among them, we can know that the time consumption of
network data transmission, read and write I/O time and CPU
share are closely connected with the size of data. Spark
provides compression configuration options and different
compression algorithms for user to select. Different
compression algorithms have different compression rates and