2017년도 춘계학술대회
ISSN 2287-4348 한국스마트미디어학회 & 한국전자거래학회 2017년도 춘계학술대회 는문집 263 page
1. Introduction
Big data systems have been designed in order to deal
with a huge generated data from a variety of aspects in our
life such as network services, agriculture development, and
scientific research areas [1]. These systems face challenges
of collecting, storing, and analyzing big data. During the last
decade, Hadoop [2] has been the most popular framework
for big data processing. It provides a parallel computation
model MapReduce [3] and Hadoop distributed file system
(HDFS) module. However, the MapReduce programming
model requires developers to write custom programs which
are hard to be maintained and reused. To simplify storing
and accessing big data, Hive [4] is proposed to support
queries expressed in a SQL-like declarative language,
HiveQL, which is compiled into MapReduce jobs executed
on Hadoop.
Recently, another big data framework, Spark [5] has
emerged as a leading distributed computing framework for
real-time analytics with memory-oriented architecture and
flexible processing libraries. In which, Spark SQL is a
component on top of Spark Core for processing structured
data. It reuses the Hive frontend and metastore giving full
compatibility with existing Hive queries. In our work, we
investigate in evaluation of the two big data engines’
performance: Hive on MapReduce and Spark SQL.
The characteristics of big data (i.e. volume, velocity,
variety and veracity) not only made the design and
implementation of big data systems to be the complex but
also was difficult in evaluating these systems. Therefore, big
data benchmarks have been developed to evaluate and
compare the performance of big data systems and engines
[6]. BigBench [7] was proposed as the first end-to-end
benchmark for big data offline analytics. It supports
evaluating both Hive on MapReduce and Spark SQL. Ivanov
et al. [8] has evaluated Hive and Spark SQL version 1.4 with
BigBench which the 8 queries from group of 14 pure
HiveQL queries run faster on Spark SQL than Hive on
MapReduce, while other queries were executed slower
because of joining issue.
Currently, the last stable version of Spark is 2.1 which
has many improvements in the Catalyst optimizer for
common workloads in of Spark SQL. Spark 2.X, which used
for common operators in SQL; and DataFrames via a new
technique called whole stage code generation, has a
substantial 2 to 10 times performance speedups comparing
to Spark 1.X. In this paper, we conduct performance
Performance Evaluation between Hive on
MapReduce and Spark SQL with BigBench and PAT
Van-Quyet Nguyen, Kyungbaek Kim
Dept. Electronics and Computer Engineering, Chonnam National University
E-Mail: quyetict@utehy.edu.vn, kyungbaekkim@jnu.ac.kr
Abstract
Big data systems have been proposed to address the challenges of big data such as collecting, storing, and
analyzing data. Recently, Hive has been the most popular data warehouse for the big data systems by
supporting HiveQL, which is compiled into MapReduce jobs executed on Hadoop; meanwhile, Spark SQL has
emerged as a leading big data framework by using in-memory based distributed computing. There are several
studies have been performed to evaluate these two frameworks and showed that in most cases Spark SQL is
faster than Hive on MapReduce, but in some cases related to joining large tables Spark SQL is slower than
Hive on MapRedue. Recently the latest version of Spark SQL has many improvements which can provide
better performance of handling SQL queries such as catalyst optimizer. In this paper, we present the new results
of performance evaluation between Hive on MapReduce and the recent Spark SQL on our big data system by
using a benchmarking tool, called BigBench, and performance analysis tool (PAT). Our experiments illustrate
that the recent Spark SQL outperforms Hive on MapReduce with all of 30 BigBench queries. Moreover, we
observed that Spark SQL consumes less network traffic and keeps higher utilization of memory usages than
Hive on MapReduce.