consists of multiple streams. For example, integer columns consist
of two streams: the present bit stream which denotes whether the
value is null and the actual data stream. The stripe footer contains
a directory of the stream locations of each column. As part of the
index data, min and max values for each column are stored, as well
as the row positions within each column. Using the built-in index,
row groups (default of 10, 000 rows) can be skipped if the query
predicate does not fall within the minimum and maximum values
of the row group.
2.3.2 Parquet File Format
The Parquet columnar format has been designed to exploit ef-
ficient compression and encoding schemes and to support nested
data. It stores data grouped together in logical horizontal parti-
tions called row groups. Every row group contains a column
chunk for each column in the table. A column chunk consists
of multiple pages and is guaranteed to be stored contiguously on
disk. Parquet’s compression and encoding schemes work at a page
level. Metadata is stored at all the levels in the hierarchy i.e., file,
column chunk, and page. The Parquet readers first parse the meta-
data to filter out the column chunks that must be accessed for a
particular query and then read each column chunk sequentially.
Impala heavily makes use of the Parquet storage format but does
not support nested columns yet. In Impala, all the data for each
row group are kept in a separate Parquet data file stored on HDFS,
to make sure that all the columns of a given row are stored on the
same node. Impala automatically sets the HDFS block size and
the Parquet file size to a maximum of 1 GB. In this way, I/O and
network requests apply to a large chunk of data.
3. CLUSTER EXPERIMENTS
In this section, we explore the performance of Hive and Impala
using a TPC-H like workload and two TPC-DS inspired workloads.
3.1 Hardware Configuration
For the experiments presented in this section, we use a 21 node
cluster. One of the nodes hosts the HDFS NameNode and sec-
ondary namenode, the Impala StateStore, and Catalog processes
and the Hive Metastore (“control” node). The remaining 20 nodes
are designated as “compute” nodes. Every node in the cluster has
2x Intel Xeon CPUs @ 2.20GHz, with 6x physical cores each (12
physical cores total), 11x SATA disks (2TB, 7k RPM), 1x 10 Giga-
bit Ethernet card, and 96GB of RAM. Out of the 11 SATA disks,
we use one for the operating system, while the remaining 10 disks
are used for HDFS. Each node runs 64-bit Ubuntu Linux 12.04.
3.2 Software Configuration
For our experiments, we use Hive version 0.12 (Hive-MR) and
Impala version 1.2.2 on top of Hadoop 2.0.0-cdh4.5.0. We also
use Hive version 0.13 (Hive-Tez) on top of Tez 0.3.0 and Apache
Hadoop 2.3.0.
1
We configured Hadoop to run 12 containers per
node (1 per core). The HDFS replication factor is set to 3 and the
maximum JVM heap size is set to 7.5 GB per task. We config-
ured Impala to use MySQL as the metastore. We run one impalad
process on each compute node and assign 90 GB of memory to it.
We enable short-circuit reads and block location tracking as per the
Impala instructions. Short-circuit reads allow reading local data di-
rectly from the filesystem. For this reason, this parameter is also
enabled for the Hive experiments. Runtime code generation in Im-
pala is enabled in all the experiments.
1
In the following sections, when the term Hive is used, this applies
to both Hive-MR and Hive-Tez.
3.3 TPC-H like Workload
For our experiments, we use a TPC-H
2
like workload, which
has been used by both the Hive and Impala engineers to evaluate
their systems. The workload scripts for Hive and Impala have been
published by the developers of these systems
3, 4
.
The workload contains the 22 read-only TPC-H queries but not
the refresh streams. We use a TPC-H database with a scale factor
of 1000 GB. We were not able to scale to larger TPC-H datasets
because of Impala’s limitation to require the workload’s working
set to fit in the cluster’s aggregate memory. However, as we will
show in our analysis, this dataset can provide significant insights
into the strengths and limitations of each system.
Although the workload scripts for both Hive and Impala are avail-
able online, these do not take into account some of the new features
of these systems. Hive has now support for nested sub-queries. In
order to exploit this feature, we re-wrote 11 TPC-H queries. The
remaining queries could not be further optimized since they already
consist of a single query. Impala was able to execute these re-
written queries and produce correct results, except for Query 18,
which produced incorrect results when nested and had to be split
into two queries.
In Hive-MR, we enabled various optimizations including: a) op-
timization of correlated queries, b) predicate push-down and in-
dex filtering when using the ORC file format, and c) map-side join
and aggregation. When a query is split into multiple sub-queries,
the intermediate results between sub-queries were materialized into
temporary tables in HDFS using the same file format/compression
codec with the base tables. Hive, when used out of the box, typ-
ically generates a large number of reduce tasks that tend to neg-
atively affect performance. We experimented with various num-
bers of reduce tasks for the workload and found that in our envi-
ronment 120 reducers produce the best performance since all the
reduce tasks can complete in one reduce round. In Hive-Tez, we
additionally enabled the vectorized execution engine.
For Impala, we computed table and column statistics for each
table using the COMPUTE STATS statement. These statistics are
used to optimize the join order and choose the join methods.
3.4 Data Preparation and Load Times
In this section, we describe the loading process in Hive and Im-
pala. We used the ORC file format in Hive and the Parquet file
format in Impala which are the popular columnar formats that each
system advertises. We also experimented with the text file format
(TXT) in order to examine the performance of each system when
the data arrives in native uncompressed text format and needs to
be processed right away. We experimented with all the compres-
sion algorithms that are available for each columnar layout, namely
snappy and zlib for the ORC file format and snappy and gzip for the
Parquet format. Our results show that snappy compression provides
slightly better query performance than zlib and gzip, for the TPC-
H workload, on both file formats. For this reason, we only report
results with snappy compression in both Hive and Impala.
We generated the data files using the TPC-H generator at 1TB
scale factor and copied them into HDFS as plain text. We use a
Hive query to convert the data from the text format to the ORC
format. Similarly, we used an Impala query to load the data into
the Parquet tables.
Table 2 presents the time to load all the TPC-H tables in each
system, for each format that we used. It also shows the aggre-
2
http://www.tpc.org/tpch/
3
https: //issues.apache.org/jira/browse/HIVE-600
4
https://github.com/kj-ki/tpc-h-impala
1297