Impala Concepts and Architecture
The following sections provide background information to help you become productive using Impala and its features.
Where appropriate, the explanations include context to help understand how aspects of Impala relate to other
technologies you might already be familiar with, such as relational database management systems and data warehouses,
or other Hadoop components such as Hive, HDFS, and HBase.
Components of the Impala Server
The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon
processes that run on specific hosts within your CDH cluster.
The Impala Daemon
The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented
by the impalad process. It reads and writes to data files; accepts queries transmitted from the impala-shell
command, Hue, JDBC, or ODBC; parallelizes the queries and distributes work across the cluster; and transmits
intermediate query results back to the central coordinator node.
You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as
the coordinator node for that query. The other nodes transmit partial results back to the coordinator, which constructs
the final result set for a query. When running experiments with functionality through the impala-shell command,
you might always connect to the same Impala daemon for convenience. For clusters running production workloads,
you might load-balance by submitting each query to a different Impala daemon in round-robin style, using the JDBC
or ODBC interfaces.
The Impala daemons are in constant communication with the statestore, to confirm which nodes are healthy and can
accept new work.
They also receive broadcast messages from the catalogd daemon (introduced in Impala 1.2) whenever any Impala
node in the cluster creates, alters, or drops any type of object, or when an INSERT or LOAD DATA statement is processed
through Impala. This background communication minimizes the need for REFRESH or INVALIDATE METADATA
statements that were needed to coordinate metadata across nodes prior to Impala 1.2.
Related information: Modifying Impala Startup Options on page 49, Starting Impala on page 48, Setting the Idle Query
and Idle Session Timeouts for impalad on page 96, Ports Used by Impala on page 638, Using Impala through a Proxy
for High Availability on page 97
The Impala Statestore
The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a
cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process
named statestored; you only need such a process on one host in the cluster. If an Impala daemon goes offline due
to hardware failure, network error, software issue, or other reason, the statestore informs all the other Impala daemons
so that future queries can avoid making requests to the unreachable node.
Because the statestore's purpose is to help when things go wrong, it is not critical to the normal operation of an Impala
cluster. If the statestore is not running or becomes unreachable, the Impala daemons continue running and distributing
work among themselves as usual; the cluster just becomes less robust if other Impala daemons fail while the statestore
is offline. When the statestore comes back online, it re-establishes communication with the Impala daemons and
resumes its monitoring function.
Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and
catalogd daemons do not have special requirements for high availability, because problems with those daemons do
not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the
Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and
restart the Impala service.
Impala Guide | 19
Impala Concepts and Architecture