Apache Impala (incubating)：快速交互式大数据查询系统

需积分: 9 71 浏览量更新于2024-07-18 收藏 7.95MB PDF 举报

"Apache Impala (incubating) Guide" Apache Impala 是一个开源的数据处理系统，它提供了快速、交互式的SQL查询功能，可以直接在Apache Hadoop的数据存储（如HDFS、HBase或Amazon S3）上运行。Impala与Hive共享统一的存储平台、元数据、SQL语法（Hive SQL）、ODBC驱动和用户界面（如Hue中的Impala查询UI），从而为实时或批处理查询提供了一个熟悉且统一的平台。 Impala并不取代基于MapReduce的批处理框架，如Hive。Hive和其他基于MapReduce的工具最适合于长时间运行的批处理作业，如ETL类型的工作。Impala的引入是为了补充大数据查询的工具集，它专注于提供更快的查询性能。 Impala的主要特性包括： 1. **快速查询**：通过避免使用MapReduce，Impala能够进行低延迟的查询，适合实时分析。 2. **与Hadoop生态系统集成**：与HDFS、HBase等紧密集成，可以无缝地读取和处理存储在Hadoop中的数据。 3. **共享元数据和SQL语法**：使用相同的元数据和Hive SQL，简化了用户的学习曲线和操作流程。 4. **组件架构**：Impala服务器由多个组件组成，包括Impala Daemon（处理查询）、Impala Statestore（维护节点状态）和Impala Catalog Service（管理元数据）。 5. **编程接口**：支持多种编程接口，方便开发Impala应用程序。 6. **适应性**：Impala与Hive协同工作，允许用户在批处理和交互式查询之间切换，并且与Hadoop生态系统中的其他组件兼容。 7. **元数据管理**：Impala使用Hive Metastore来存储表和分区的信息，同时也直接与HDFS和HBase交互。 8. **硬件和软件需求**：Impala支持特定的操作系统，需要配置Hive Metastore、Java依赖项、网络设置以及满足特定的硬件和用户账户要求。 9. **集群规划**：包括对Impala部署的硬件规模、Schema设计等方面的考虑，以确保最佳性能。安装和管理Impala涉及到安装过程后的配置步骤，包括设置Impala的配置参数，调整以适应特定环境的需求。这些配置可能涉及内存分配、并发查询限制、日志记录等，以优化Impala在生产环境中的性能和稳定性。 Apache Impala是一个针对大数据的高性能查询引擎，旨在提升Hadoop生态系统的实时分析能力，同时保持与现有工具的兼容性，使得数据分析更为便捷和高效。通过理解和利用其特性和集成机制，用户可以构建更强大的大数据处理和分析解决方案。

| Impala Concepts and Architecture | 16

can stop the Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a

different host, and restart the Impala service.

Note:

In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table is created

in Hive, allowing you to make individual tables visible to Impala without doing a full reload of the catalog metadata.

Impala 1.2.4 also includes other changes to make the metadata broadcast mechanism faster and more responsive,

especially during Impala startup. See New Features in Impala 1.2.4 on page 739 for details.

Related information: Modifying Impala Startup Options on page 32, Starting Impala on page 31, Ports

Used by Impala on page 715

Developing Impala Applications

The core development language with Impala is SQL. You can also use Java or other languages to interact with Impala

through the standard JDBC and ODBC interfaces used by many business intelligence tools. For specialized kinds of

analysis, you can supplement the SQL built-in functions by writing user-defined functions (UDFs) in C++ or Java.

Overview of the Impala SQL Dialect

The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL).

As such, it is familiar to users who are already familiar with running SQL queries on the Hadoop infrastructure.

Currently, Impala SQL supports a subset of HiveQL statements, data types, and built-in functions. Impala also

includes additional built-in functions for common industry features, to simplify porting SQL from non-Hadoop

systems.

For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the

SQL dialect might seem familiar:

•

The SELECT statement includes familiar clauses such as WHERE, GROUP BY, ORDER BY, and WITH. You

will find familiar notions such as joins, built-in functions for processing strings, numbers, and dates, aggregate

functions, subqueries, and comparison operators such as IN() and BETWEEN. The SELECT statement is the

place where SQL standards compliance is most important.

•

From the data warehousing world, you will recognize the notion of partitioned tables. One or more columns

serve as partition keys, and the data is physically arranged so that queries that refer to the partition key columns

in the WHERE clause can skip partitions that do not match the filter conditions. For example, if you have 10 years

worth of data and use a clause such as WHERE year = 2015, WHERE year > 2010, or WHERE year IN

(2014, 2015), Impala skips all the data for non-matching years, greatly reducing the amount of I/O for the

query.

•

In Impala 1.2 and higher, UDFs let you perform custom comparisons and transformation logic during SELECT

and INSERT...SELECT statements.

For users coming to Impala from traditional database or data warehousing backgrounds, the following aspects of the

SQL dialect might require some learning and practice for you to become proficient in the Hadoop environment:

• Impala SQL is focused on queries and includes relatively little DML. There is no UPDATE or DELETE statement.

Stale data is typically discarded (by DROP TABLE or ALTER TABLE ... DROP PARTITION statements) or

replaced (by INSERT OVERWRITE statements).

• All data creation is done by INSERT statements, which typically insert data in bulk by querying from other tables.

There are two variations, INSERT INTO which appends to the existing data, and INSERT OVERWRITE which

replaces the entire contents of a table or partition (similar to TRUNCATE TABLE followed by a new INSERT).

Although there is an INSERT ... VALUES syntax to create a small number of values in a single statement, it is

far more efficient to use the INSERT ... SELECT to copy and transform large amounts of data from one table

to another in a single operation.

• You often construct Impala table definitions and data files in some other environment, and then attach Impala so

that it can run real-time queries. The same data files and table metadata are shared with other components of the

Hadoop ecosystem. In particular, Impala can access tables created by Hive or data inserted by Hive, and Hive can

| Impala Concepts and Architecture | 17

access tables and data produced by Impala. Many other Hadoop components can write files in formats such as

Parquet and Avro, that can then be queried by Impala.

• Because Hadoop and Impala are focused on data warehouse-style operations on large data sets, Impala SQL

includes some idioms that you might find in the import utilities for traditional database systems. For example, you

can create a table that reads comma-separated or tab-separated text files, specifying the separator in the CREATE

TABLE statement. You can create external tables that read existing data files but do not move or transform them.

• Because Impala reads large quantities of data that might not be perfectly tidy and predictable, it does not require

length constraints on string data types. For example, you can define a database column as STRING with unlimited

length, rather than CHAR(1) or VARCHAR(64). (Although in Impala 2.0 and later, you can also use length-

constrained CHAR and VARCHAR types.)

Related information: Impala SQL Language Reference on page 105, especially Impala SQL Statements on page

219 and Impala Built-In Functions on page 397

Overview of Impala Programming Interfaces

You can connect and submit requests to the Impala daemons through:

• The impala-shell interactive command interpreter.

•

The Hue web-based user interface.

•

JDBC.

•

ODBC.

With these options, you can use Impala in heterogeneous environments, with JDBC or ODBC applications running on

non-Linux platforms. You can also use Impala on combination with various Business Intelligence tools that use the

JDBC and ODBC interfaces.

Each impalad daemon process, running on separate nodes in a cluster, listens to several ports for incoming

requests. Requests from impala-shell and Hue are routed to the impalad daemons through the same port. The

impalad daemons listen on separate ports for JDBC and ODBC requests.

How Impala Fits Into the Hadoop Ecosystem

Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with

other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and ELT

pipelines.

How Impala Works with Hive

A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new categories

of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing Apache Hive

infrastructure that many Hadoop users already have in place to perform long-running, batch-oriented SQL queries.

In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as the

metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables defined or loaded by

Hive, as long as all columns use Impala-supported data types, file formats, and compression codecs.

The initial focus on query features and performance means that Impala can read more types of data with the SELECT

statement than it can write with the INSERT statement. To query data using the Avro, RCFile, or SequenceFile file

formats, you load the data using Hive.

The Impala query optimizer can also make use of table statistics and column statistics. Originally, you gathered this

information with the ANALYZE TABLE statement in Hive; in Impala 1.2.2 and higher, use the Impala COMPUTE

STATS statement instead. COMPUTE STATS requires less setup, is more reliable, and does not require switching

back and forth between impala-shell and the Hive shell.

| Planning for Impala Deployment | 18

Overview of Impala Metadata and the Metastore

As discussed in How Impala Works with Hive on page 17, Impala maintains information about table definitions in

a central database known as the metastore. Impala also tracks other metadata for the low-level characteristics of data

files:

• The physical locations of blocks within HDFS.

For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can be time-

consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to reuse for future

queries against the same table.

If the table definition or the data in the table is updated, all other Impala daemons in the cluster must receive the latest

metadata, replacing the obsolete cached metadata, before issuing a query against that table. In Impala 1.2 and higher,

the metadata update is automatic, coordinated through the catalogd daemon, for all DDL and DML statements

issued through Impala. See The Impala Catalog Service on page 15 for details.

For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the REFRESH

statement (when new data files are added to existing tables) or the INVALIDATE METADATA statement (for entirely

new tables, or after dropping a table, performing an HDFS rebalance operation, or deleting data files). Issuing

INVALIDATE METADATA by itself retrieves metadata for all the tables tracked by the metastore. If you know that

only specific tables have been changed outside of Impala, you can issue REFRESH table_name for each affected

table to only retrieve the latest metadata for those tables.

How Impala Uses HDFS

Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the redundancy

provided by HDFS to guard against hardware or network outages on individual nodes. Impala table data is physically

represented as data files in HDFS, using familiar HDFS file formats and compression codecs. When data files are

present in the directory for a new table, Impala reads them all, regardless of file name. New data is added in files with

names controlled by Impala.

How Impala Uses HBase

HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built on top of

HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large (often sparse)

data sets in it. By defining tables in Impala and mapping them to equivalent tables in HBase, you can query the

contents of the HBase tables through Impala, and even perform join queries including both Impala and HBase tables.

See Using Impala to Query HBase Tables on page 685 for details.

Planning for Impala Deployment

Before you set up Impala in production, do some planning to make sure that your hardware setup has sufficient

capacity, that your cluster topology is optimal for Impala queries, and that your schema design and ETL processes

follow the best practices for Impala.

Impala Requirements

To perform as expected, Impala depends on the availability of the software, hardware, and configurations described in

the following sections.

Supported Operating Systems

Apache Impala runs on Linux systems only. See the README.md file for more information.

| Planning for Impala Deployment | 19

Hive Metastore and Related Configuration

Impala can interoperate with data stored in Hive, and uses the same infrastructure as Hive for tracking metadata about

schema objects such as tables and columns. The following components are prerequisites for Impala:

• MySQL or PostgreSQL, to act as a metastore database for both Impala and Hive.

Note:

Installing and configuring a Hive metastore is an Impala requirement. Impala does not work without the metastore

database. For the process of installing and configuring the metastore, see Installing Impala on page 24.

Always configure a Hive metastore service rather than connecting directly to the metastore database. The Hive

metastore service is required to interoperate between different levels of metastore APIs if this is necessary for your

environment, and using it avoids known issues with connecting directly to the metastore database.

A summary of the metastore installation process is as follows:

• Install a MySQL or PostgreSQL database. Start the database if it is not started after installation.

•

Download the MySQL connector or the PostgreSQL connector and place it in the /usr/share/java/

directory.

• Use the appropriate command line tool for your database to create the metastore database.

• Use the appropriate command line tool for your database to grant privileges for the metastore database to the

hive user.

• Modify hive-site.xml to include information matching your particular database: its URL, username, and

password. You will copy the hive-site.xml file to the Impala Configuration Directory later in the Impala

installation process.

• Optional: Hive. Although only the Hive metastore database is required for Impala to function, you might install

Hive on some client machines to create and load data into tables that use certain file formats. See How Impala

Works with Hadoop File Formats on page 639 for details. Hive does not need to be installed on the same

DataNodes as Impala; it just needs access to the same metastore database.

Java Dependencies

Although Impala is primarily written in C++, it does use Java to communicate with various Hadoop components:

• The officially supported JVM for Impala is the Oracle JVM. Other JVMs might cause issues, typically resulting in

a failure at impalad startup. In particular, the JamVM used by default on certain levels of Ubuntu systems can

cause impalad to fail to start.

• Internally, the impalad daemon relies on the JAVA_HOME environment variable to locate the system Java

libraries. Make sure the impalad service is not run from an environment with an incorrect setting for this

variable.

• All Java dependencies are packaged in the impala-dependencies.jar file, which is located at /usr/

lib/impala/lib/. These map to everything that is built under fe/target/dependency.

Networking Configuration Requirements

As part of ensuring best performance, Impala attempts to complete tasks on local data, as opposed to using network

connections to work with remote data. To support this goal, Impala matches the hostname provided to each Impala

daemon with the IP address of each DataNode by resolving the hostname flag to an IP address. For Impala to work

with local data, use a single IP interface for the DataNode and the Impala daemon on each machine. Ensure that

the Impala daemon's hostname flag resolves to the IP address of the DataNode. For single-homed machines, this is

usually automatic, but for multi-homed machines, ensure that the Impala daemon's hostname resolves to the correct

interface. Impala tries to detect the correct hostname at start-up, and prints the derived hostname at the start of the log

in a message of the form:

Using hostname: impala-daemon-1.example.com

In the majority of cases, this automatic detection works correctly. If you need to explicitly set the hostname, do so by

setting the --hostname flag.

| Planning for Impala Deployment | 20

Hardware Requirements

During join operations, portions of data from each joined table are loaded into memory. Data sets can be very large,

so ensure your hardware has sufficient memory to accommodate the joins you anticipate completing.

While requirements vary according to data set size, the following is generally recommended:

• CPU - Impala version 2.2 and higher uses the SSSE3 instruction set, which is included in newer processors.

Note: This required level of processor is the same as in Impala version 1.x. The Impala 2.0 and 2.1 releases had a

stricter requirement for the SSE4.1 instruction set, which has now been relaxed.

• Memory - 128 GB or more recommended, ideally 256 GB or more. If the intermediate results during query

processing on a particular node exceed the amount of memory available to Impala on that node, the query writes

temporary work data to disk, which can lead to long query times. Note that because the work is parallelized, and

intermediate results for aggregate queries are typically smaller than the original data, Impala can query and join

tables that are much larger than the memory available on an individual node.

• Storage - DataNodes with 12 or more disks each. I/O speeds are often the limiting factor for disk performance

with Impala. Ensure that you have sufficient disk space to store the data Impala will be querying.

User Account Requirements

Impala creates and uses a user and group named impala. Do not delete this account or group and do not modify the

account's or group's permissions and rights. Ensure no existing systems obstruct the functioning of these accounts and

groups. For example, if you have scripts that delete user accounts not in a white-list, add these accounts to the list of

permitted accounts.

For correct file deletion during DROP TABLE operations, Impala must be able to move files to the HDFS trashcan.

You might need to create an HDFS directory /user/impala, writeable by the impala user, so that the trashcan

can be created. Otherwise, data files might remain behind after a DROP TABLE statement.

Impala should not run as root. Best Impala performance is achieved using direct reads, but root is not permitted to use

direct reads. Therefore, running Impala as root negatively affects performance.

By default, any user can connect to Impala and access all the associated databases and tables. You can enable

authorization and authentication based on the Linux OS user who connects to the Impala server, and the associated

groups for that user. Impala Security on page 82 for details. These security features do not change the underlying

file permission requirements; the impala user still needs to be able to access the data files.

Cluster Sizing Guidelines for Impala

This document provides a very rough guideline to estimate the size of a cluster needed for a specific customer

application. You can use this information when planning how much and what type of hardware to acquire for a new

cluster, or when adding Impala workloads to an existing cluster.

Note: Before making purchase or deployment decisions, consult organizations with relevant experience to verify the

conclusions about hardware requirements based on your data volume and workload.

Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, Impala divides

the work evenly between cluster nodes, regardless of their exact hardware configuration. Because work can be

distributed in different ways for different queries, if some hosts are overloaded compared to others in terms of CPU,

memory, I/O, or network, you might experience inconsistent performance and overall slowness

For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, 12 2

TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of DataNodes

needed in the cluster based on data size and the number of concurrent queries, for workloads similar to TPC-DS

benchmark queries:

剩余807页未读，继续阅读

weixin_43579062

粉丝: 0

Apache Impala (incubating)：快速交互式大数据查询系统

源码编译安装

impala的安装

cdh6.3.2+cm6.3.1+gpl安装包.txt

impala-tpcds-kit:Impala 的 TPC-DS 套件

Apache Impala Guide impala-3.3.pdf

Impala-Kudu-HBase-Spark安装文档

oozie-impala-action

impala-2.1

impala-jdbc

Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf

最新资源