Hadoop数据分析：SQL-on-Hadoop技术性能对比研究

需积分: 10 36 浏览量更新于2024-08-27 收藏 224KB PDF 举报

"SQL-on-Hadoop：回归共享无数据库架构" SQL-on-Hadoop是指在Hadoop平台上执行SQL查询的技术，近年来获得了极大的关注。Hive是第一个 native Hadoop系统，使用MapReduce或Tez框架来处理SQL-like语句。Impala则代表了一种新的SQL-on-Hadoop系统，使用共享无并行数据库架构在Hadoop上执行SQL查询。两种系统都优化了数据ingestion通过列存储，并且支持不同的文件格式，如ORC和Parquet。在这篇论文中，我们对Hive和Impala两种系统进行了性能比较，使用了TPC-H like benchmark和两个TPC-DS inspired工作负载。我们的实验结果表明，Impala的性能是Hive的3.3倍。此外，我们还对两种系统的列存储格式的I/O效率进行了微型benchmark测试。 SQL-on-Hadoop技术的兴起是由于Hadoop平台的广泛应用和数据分析需求的增加。Hadoop平台提供了一个可扩展的存储系统，可以存储大量的数据。但是，Hadoop本身不提供SQL查询功能，因此需要使用SQL-on-Hadoop技术来执行SQL查询。 Hive是第一个 native Hadoop系统，使用MapReduce或Tez框架来处理SQL-like语句。Hive提供了一个SQL接口，允许用户使用SQL语句来查询Hadoop数据。但是，Hive的性能受限于MapReduce或Tez框架的限制，无法充分利用Hadoop平台的并行处理能力。 Impala则是使用共享无并行数据库架构的SQL-on-Hadoop系统，能够充分利用Hadoop平台的并行处理能力，提供更高的查询性能。Impala使用一个分布式查询引擎，能够并行处理查询请求，提高查询性能。此外，Impala还提供了一个列存储格式，能够优化数据ingestion和查询性能。在数据仓库和商业智能应用中，SQL-on-Hadoop技术可以提供更高的查询性能和更好的数据分析能力。SQL-on-Hadoop技术的发展也推动了大数据分析和商业智能的发展。 SQL-on-Hadoop技术是大数据分析和商业智能的关键技术之一，能够提供更高的查询性能和更好的数据分析能力。Impala和Hive是两种常用的SQL-on-Hadoop系统，Impala的性能更高，能够更好地满足大数据分析和商业智能的需求。

consists of multiple streams. For example, integer columns consist

of two streams: the present bit stream which denotes whether the

value is null and the actual data stream. The stripe footer contains

a directory of the stream locations of each column. As part of the

index data, min and max values for each column are stored, as well

as the row positions within each column. Using the built-in index,

row groups (default of 10, 000 rows) can be skipped if the query

predicate does not fall within the minimum and maximum values

of the row group.

2.3.2 Parquet File Format

The Parquet columnar format has been designed to exploit ef-

ﬁcient compression and encoding schemes and to support nested

data. It stores data grouped together in logical horizontal parti-

tions called row groups. Every row group contains a column

chunk for each column in the table. A column chunk consists

of multiple pages and is guaranteed to be stored contiguously on

disk. Parquet’s compression and encoding schemes work at a page

level. Metadata is stored at all the levels in the hierarchy i.e., ﬁle,

column chunk, and page. The Parquet readers ﬁrst parse the meta-

data to ﬁlter out the column chunks that must be accessed for a

particular query and then read each column chunk sequentially.

Impala heavily makes use of the Parquet storage format but does

not support nested columns yet. In Impala, all the data for each

row group are kept in a separate Parquet data ﬁle stored on HDFS,

to make sure that all the columns of a given row are stored on the

same node. Impala automatically sets the HDFS block size and

the Parquet ﬁle size to a maximum of 1 GB. In this way, I/O and

network requests apply to a large chunk of data.

3. CLUSTER EXPERIMENTS

In this section, we explore the performance of Hive and Impala

using a TPC-H like workload and two TPC-DS inspired workloads.

3.1 Hardware Conﬁguration

For the experiments presented in this section, we use a 21 node

cluster. One of the nodes hosts the HDFS NameNode and sec-

ondary namenode, the Impala StateStore, and Catalog processes

and the Hive Metastore (“control” node). The remaining 20 nodes

are designated as “compute” nodes. Every node in the cluster has

2x Intel Xeon CPUs @ 2.20GHz, with 6x physical cores each (12

physical cores total), 11x SATA disks (2TB, 7k RPM), 1x 10 Giga-

bit Ethernet card, and 96GB of RAM. Out of the 11 SATA disks,

we use one for the operating system, while the remaining 10 disks

are used for HDFS. Each node runs 64-bit Ubuntu Linux 12.04.

3.2 Software Conﬁguration

For our experiments, we use Hive version 0.12 (Hive-MR) and

Impala version 1.2.2 on top of Hadoop 2.0.0-cdh4.5.0. We also

use Hive version 0.13 (Hive-Tez) on top of Tez 0.3.0 and Apache

Hadoop 2.3.0.

We conﬁgured Hadoop to run 12 containers per

node (1 per core). The HDFS replication factor is set to 3 and the

maximum JVM heap size is set to 7.5 GB per task. We conﬁg-

ured Impala to use MySQL as the metastore. We run one impalad

process on each compute node and assign 90 GB of memory to it.

We enable short-circuit reads and block location tracking as per the

Impala instructions. Short-circuit reads allow reading local data di-

rectly from the ﬁlesystem. For this reason, this parameter is also

enabled for the Hive experiments. Runtime code generation in Im-

pala is enabled in all the experiments.

In the following sections, when the term Hive is used, this applies

to both Hive-MR and Hive-Tez.

3.3 TPC-H like Workload

For our experiments, we use a TPC-H

like workload, which

has been used by both the Hive and Impala engineers to evaluate

their systems. The workload scripts for Hive and Impala have been

published by the developers of these systems

3, 4

The workload contains the 22 read-only TPC-H queries but not

the refresh streams. We use a TPC-H database with a scale factor

of 1000 GB. We were not able to scale to larger TPC-H datasets

because of Impala’s limitation to require the workload’s working

set to ﬁt in the cluster’s aggregate memory. However, as we will

show in our analysis, this dataset can provide signiﬁcant insights

into the strengths and limitations of each system.

Although the workload scripts for both Hive and Impala are avail-

able online, these do not take into account some of the new features

of these systems. Hive has now support for nested sub-queries. In

order to exploit this feature, we re-wrote 11 TPC-H queries. The

remaining queries could not be further optimized since they already

consist of a single query. Impala was able to execute these re-

written queries and produce correct results, except for Query 18,

which produced incorrect results when nested and had to be split

into two queries.

In Hive-MR, we enabled various optimizations including: a) op-

timization of correlated queries, b) predicate push-down and in-

dex ﬁltering when using the ORC ﬁle format, and c) map-side join

and aggregation. When a query is split into multiple sub-queries,

the intermediate results between sub-queries were materialized into

temporary tables in HDFS using the same ﬁle format/compression

codec with the base tables. Hive, when used out of the box, typ-

ically generates a large number of reduce tasks that tend to neg-

atively affect performance. We experimented with various num-

bers of reduce tasks for the workload and found that in our envi-

ronment 120 reducers produce the best performance since all the

reduce tasks can complete in one reduce round. In Hive-Tez, we

additionally enabled the vectorized execution engine.

For Impala, we computed table and column statistics for each

table using the COMPUTE STATS statement. These statistics are

used to optimize the join order and choose the join methods.

3.4 Data Preparation and Load Times

In this section, we describe the loading process in Hive and Im-

pala. We used the ORC ﬁle format in Hive and the Parquet ﬁle

format in Impala which are the popular columnar formats that each

system advertises. We also experimented with the text ﬁle format

(TXT) in order to examine the performance of each system when

the data arrives in native uncompressed text format and needs to

be processed right away. We experimented with all the compres-

sion algorithms that are available for each columnar layout, namely

snappy and zlib for the ORC ﬁle format and snappy and gzip for the

Parquet format. Our results show that snappy compression provides

slightly better query performance than zlib and gzip, for the TPC-

H workload, on both ﬁle formats. For this reason, we only report

results with snappy compression in both Hive and Impala.

We generated the data ﬁles using the TPC-H generator at 1TB

scale factor and copied them into HDFS as plain text. We use a

Hive query to convert the data from the text format to the ORC

format. Similarly, we used an Impala query to load the data into

the Parquet tables.

Table 2 presents the time to load all the TPC-H tables in each

system, for each format that we used. It also shows the aggre-

http://www.tpc.org/tpch/

https: //issues.apache.org/jira/browse/HIVE-600

https://github.com/kj-ki/tpc-h-impala

1297

剩余11页未读，继续阅读

xiaopenghust

粉丝: 0
资源: 2

Hadoop数据分析：SQL-on-Hadoop技术性能对比研究

spark-2.4.0-bin-without-hadoop.tgz

基于SQL-on-Hadoop查询引擎的日志挖掘及其应用.pdf

k-means算法深度解析：优缺点剖析与5大优化技巧揭秘

Max-Min算法深度解析：与NP难题的边界，性能与可扩展性提升

初识Hadoop：大数据时代的开端

初识Hadoop：大数据处理的基础知识

Netty和Hadoop：构建分布式数据处理与计算框架

深入解析hadoop：掌握mapreduce计算模型

初探hadoop：什么是大数据处理

Spark与Hadoop：大数据处理的异同

最新资源