Impala: A Modern, Open-Source SQL Engine for Hadoop
Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky
Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht
Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang
Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus
John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder
Cloudera
http://impala.io/
ABSTRACT
Cloudera Impala is a modern, open-source MPP SQL en-
gine architected from the ground up for the Hadoop data
processing environment. Impala provides low latency and
high concurrency for BI/analytic read-mostly queries on
Hadoop, not delivered by batch frameworks such as Apache
Hive. This paper presents Impala from a user’s perspective,
gives an overview of its architecture and main components
and briefly demonstrates its superior performance compared
against other popular SQL-on-Hadoop systems.
1. INTRODUCTION
Impala is an open-source
1
, fully-integrated, state-of-the-
art MPP SQL query engine designed specifically to leverage
the flexibility and scalability of Hadoop. Impala’s goal is
to combine the familiar SQL support and multi-user perfor-
mance of a traditional analytic database with the scalability
and flexibility of Apache Hadoop and the production-grade
security and management extensions of Cloudera Enterprise.
Impala’s beta release was in October 2012 and it GA’ed in
May 2013. The most recent version, Impala 2.0, was released
in October 2014. Impala’s ecosystem momentum continues
to accelerate, with nearly one million downloads since its
GA.
Unlike other systems (often forks of Postgres), Impala is a
brand-new engine, written from the ground up in C++ and
Java. It maintains Hadoop’s flexibility by utilizing standard
components (HDFS, HBase, Metastore, YARN, Sentry) and
is able to read the majority of the widely-used file formats
(e.g. Parquet, Avro, RCFile). To reduce latency, such as
that incurred from utilizing MapReduce or by reading data
remotely, Impala implements a distributed architecture based
on daemon processes that are responsible for all aspects of
query execution and that run on the same machines as the
rest of the Hadoop infrastructure. The result is performance
1
https://github.com/cloudera/impala
This article is published under a Creative Commons Attribution Li-
cense(http://creativecommons.org/licenses/by/3.0/), which permits distri-
bution and reproduction in any medium as well as allowing derivative
works, provided that you attribute the original work to the author(s) and
CIDR 2015.
7th Biennial Conference on Innovative Data Systems Research (CIDR’15)
January 4-7, 2015, Asilomar, California, USA.
that is on par or exceeds that of commercial MPP analytic
DBMSs, depending on the particular workload.
This paper discusses the services Impala provides to the
user and then presents an overview of its architecture and
main components. The highest performance that is achiev-
able today requires using HDFS as the underlying storage
manager, and therefore that is the focus on this paper; when
there are notable differences in terms of how certain technical
aspects are handled in conjunction with HBase, we note that
in the text without going into detail.
Impala is the highest performing SQL-on-Hadoop system,
especially under multi-user workloads. As
Section 7
shows,
for single-user queries, Impala is up to 13x faster than alter-
natives, and 6.7x faster on average. For multi-user queries,
the gap widens: Impala is up to 27.4x faster than alternatives,
and 18x faster on average – or nearly three times faster on
average for multi-user queries than for single-user ones.
The remainder of this paper is structured as follows: the
next section gives an overview of Impala from the user’s
perspective and points out how it differs from a traditional
RDBMS.
Section 3
presents the overall architecture of the
system.
Section 4
presents the frontend component, which
includes a cost-based distributed query optimizer,
Section 5
presents the backend component, which is responsible for the
query execution and employs runtime code generation, and
Section 6
presents the resource/workload management com-
ponent.
Section 7
briefly evaluates the performance of Im-
pala.
Section 8
discusses the roadmap ahead and
Section 9
concludes.
2. USER VIEW OF IMPALA
Impala is a query engine which is integrated into the
Hadoop environment and utilizes a number of standard
Hadoop components (Metastore, HDFS, HBase, YARN, Sen-
try) in order to deliver an RDBMS-like experience. However,
there are some important differences that will be brought up
in the remainder of this section.
Impala was specifically targeted for integration with stan-
dard business intelligence environments, and to that end
supports most relevant industry standards: clients can con-
nect via ODBC or JDBC; authentication is accomplished
with Kerberos or LDAP; authorization follows the standard
SQL roles and privileges
2
. In order to query HDFS-resident
2
This is provided by another standard Hadoop component
called Sentry
[4]
, which also makes role-based authoriza-
tion available to Hive, and other components.