| Impala Concepts and Architecture | 16
can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a
different host, and restart the Impala service.
Note:
In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table is created
in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata.
Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive,
especially during Impala startup. See New Features in Impala 1.2.4 on page 739 for details.
Related information: Modifying Impala Startup Options on page 32, Starting Impala on page 31, Ports
Used by Impala on page 715
Developing Impala Applications
The core development language with Impala is SQL. You can also use Java or other languages to interact with Impala
through the standard JDBC and ODBC interfaces used by many business intelligence tools. For specialized kinds of
analysis, you can supplement the SQL built-in functions by writing user-defined functions (UDFs) in C++ or Java.
Overview of the Impala SQL Dialect
The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL).
As such, it is familiar to users who are already familiar with running SQL queries on the Hadoop infrastructure.
Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in functions. Impala also
includes additional built-in functions for common industry features, to simplify porting SQL from non-Hadoop
systems.
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the
SQL dialect might seem familiar:
•
The SELECT statement includes familiar clauses such as WHERE, GROUP BY, ORDER BY, and WITH. You
will find familiar notions such as joins, built-in functions for processing strings, numbers, and dates, aggregate
functions, subqueries, and comparison operators such as IN() and BETWEEN. The SELECT statement is the
place where SQL standards compliance is most important.
•
From the data warehousing world, you will recognize the notion of partitioned tables. One or more columns
serve as partition keys, and the data is physically arranged so that queries that refer to the partition key columns
in the WHERE clause can skip partitions that do not match the filter conditions. For example, if you have 10 years
worth of data and use a clause such as WHERE year = 2015, WHERE year > 2010, or WHERE year IN
(2014, 2015), Impala skips all the data for non-matching years, greatly reducing the amount of I/O for the
query.
•
In Impala 1.2 and higher, UDFs let you perform custom comparisons and transformation logic during SELECT
and INSERT...SELECT statements.
For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the
SQL dialect might require some learning and practice for you to become proficient in the Hadoop environment:
• Impala SQL is focused on queries and includes relatively little DML. There is no UPDATE or DELETE statement.
Stale data is typically discarded (by DROP TABLE or ALTER TABLE ... DROP PARTITION statements) or
replaced (by INSERT OVERWRITE statements).
• All data creation is done by INSERT statements, which typically insert data in bulk by querying from other tables.
There are two variations, INSERT INTO which appends to the existing data, and INSERT OVERWRITE which
replaces the entire contents of a table or partition (similar to TRUNCATE TABLE followed by a new INSERT).
Although there is an INSERT ... VALUES syntax to create a small number of values in a single statement, it is
far more efficient to use the INSERT ... SELECT to copy and transform large amounts of data from one table
to another in a single operation.
• You often construct Impala table definitions and data files in some other environment, and then attach Impala so
that it can run real-time queries. The same data files and table metadata are shared with other components of the
Hadoop ecosystem. In particular, Impala can access tables created by Hive or data inserted by Hive, and Hive can