Hive与SparkSQL在大数据处理中的应用比较

需积分: 10 22 浏览量更新于2024-08-15 收藏 596KB PDF 举报

"Hive on MapReduce and Spark SQL with Big Bench" 本文主要探讨了大数据系统的发展，特别是Hadoop、Hive以及Spark SQL在处理大规模数据时的角色。Hadoop作为过去十年中最受欢迎的大数据处理框架，其核心是MapReduce并行计算模型和HDFS分布式文件系统。MapReduce模型虽然强大，但需要开发者编写自定义程序，维护和重用成本较高。 Hive应运而生，它提供了一种SQL-like的声明性语言HiveQL，使得对大数据的存储和访问变得更加简单。HiveQL被编译成MapReduce作业在Hadoop上执行，降低了数据分析的门槛，使得非程序员也能进行数据查询和分析。近年来，Spark框架迅速崛起，尤其是在实时分析领域。Spark以其内存导向的架构和灵活的处理库，如Spark SQL，成为了一个领先的分布式计算框架。Spark SQL是Spark生态系统中的一个组件，它可以直接处理结构化数据，并与DataFrame和Dataset API结合，提高了数据处理的效率和易用性。相比于Hadoop MapReduce，Spark SQL提供了更快的数据处理速度，因为它支持在内存中计算，减少了磁盘I/O操作。 Big Bench则是一个用于评估大数据查询处理系统的基准测试套件。它为大数据分析系统提供了一系列复杂的商业智能查询，旨在测试系统的性能、可扩展性和稳定性。通过在Hive on MapReduce和Spark SQL上运行Big Bench测试，可以比较两者的性能差异，为实际应用选择合适的数据处理工具提供依据。这篇文章深入讨论了Hadoop、Hive和Spark SQL在大数据处理中的优缺点，以及如何通过Big Bench测试来衡量这些系统的性能。随着大数据技术的不断发展，选择适合的工具对于优化数据分析流程和提升业务洞察力至关重要。

2017년도 춘계학술대회

ISSN 2287-4348 한국스마트미디어학회 & 한국전자거래학회 2017년도 춘계학술대회 는문집 263 page

1. Introduction

Big data systems have been designed in order to deal

with a huge generated data from a variety of aspects in our

life such as network services, agriculture development, and

scientific research areas [1]. These systems face challenges

of collecting, storing, and analyzing big data. During the last

decade, Hadoop [2] has been the most popular framework

for big data processing. It provides a parallel computation

model MapReduce [3] and Hadoop distributed file system

(HDFS) module. However, the MapReduce programming

model requires developers to write custom programs which

are hard to be maintained and reused. To simplify storing

and accessing big data, Hive [4] is proposed to support

queries expressed in a SQL-like declarative language,

HiveQL, which is compiled into MapReduce jobs executed

on Hadoop.

Recently, another big data framework, Spark [5] has

emerged as a leading distributed computing framework for

real-time analytics with memory-oriented architecture and

flexible processing libraries. In which, Spark SQL is a

component on top of Spark Core for processing structured

data. It reuses the Hive frontend and metastore giving full

compatibility with existing Hive queries. In our work, we

investigate in evaluation of the two big data engines’

performance: Hive on MapReduce and Spark SQL.

The characteristics of big data (i.e. volume, velocity,

variety and veracity) not only made the design and

implementation of big data systems to be the complex but

also was difficult in evaluating these systems. Therefore, big

data benchmarks have been developed to evaluate and

compare the performance of big data systems and engines

[6]. BigBench [7] was proposed as the first end-to-end

benchmark for big data offline analytics. It supports

evaluating both Hive on MapReduce and Spark SQL. Ivanov

et al. [8] has evaluated Hive and Spark SQL version 1.4 with

BigBench which the 8 queries from group of 14 pure

HiveQL queries run faster on Spark SQL than Hive on

MapReduce, while other queries were executed slower

because of joining issue.

Currently, the last stable version of Spark is 2.1 which

has many improvements in the Catalyst optimizer for

common workloads in of Spark SQL. Spark 2.X, which used

for common operators in SQL; and DataFrames via a new

technique called whole stage code generation, has a

substantial 2 to 10 times performance speedups comparing

to Spark 1.X. In this paper, we conduct performance

Performance Evaluation between Hive on

MapReduce and Spark SQL with BigBench and PAT

Van-Quyet Nguyen, Kyungbaek Kim

Dept. Electronics and Computer Engineering, Chonnam National University

E-Mail: quyetict@utehy.edu.vn, kyungbaekkim@jnu.ac.kr

Abstract

Big data systems have been proposed to address the challenges of big data such as collecting, storing, and

analyzing data. Recently, Hive has been the most popular data warehouse for the big data systems by

supporting HiveQL, which is compiled into MapReduce jobs executed on Hadoop; meanwhile, Spark SQL has

emerged as a leading big data framework by using in-memory based distributed computing. There are several

studies have been performed to evaluate these two frameworks and showed that in most cases Spark SQL is

faster than Hive on MapReduce, but in some cases related to joining large tables Spark SQL is slower than

Hive on MapRedue. Recently the latest version of Spark SQL has many improvements which can provide

better performance of handling SQL queries such as catalyst optimizer. In this paper, we present the new results

of performance evaluation between Hive on MapReduce and the recent Spark SQL on our big data system by

using a benchmarking tool, called BigBench, and performance analysis tool (PAT). Our experiments illustrate

that the recent Spark SQL outperforms Hive on MapReduce with all of 30 BigBench queries. Moreover, we

observed that Spark SQL consumes less network traffic and keeps higher utilization of memory usages than

Hive on MapReduce.

下载后可阅读完整内容，剩余3页未读，立即下载

菜鸟一碗好汤

粉丝: 1
资源: 9

Hive与SparkSQL在大数据处理中的应用比较

1基于STM32的智能气象站项目.docx

技术资料分享SH-HC-05蓝牙模块技术手册很好的技术资料.zip

【路径规划】改进的人工势场算法机器人避障路径规划【含Matlab源码 1151期】.zip

链表HuffmanTree.zip

开题报告Nodejs商城系统.docx

【路径规划】 A_star算法机器人走迷宫路径规划【含Matlab源码 1332期】.zip

用于分析的牙科X射线图像数据集

Oracle数据库管理中的表空间、用户操作及DDL、DML语言应用详解

fluttersdk windows 3.24.3

开题报告Hadoop借书驿站系统.docx

最新资源