Cloudera Impala指南：快速Hadoop数据分析

5星 · 超过95%的资源需积分: 9 141 浏览量更新于2024-07-20 收藏 6.97MB PDF 举报

"Impala Guide" Cloudera Impala是一款针对Apache Hadoop数据的快速、交互式SQL查询系统，它能够直接处理存储在HDFS（Hadoop分布式文件系统）、HBase或Amazon S3上的数据。Impala的核心优势在于，它不仅与Hadoop的数据存储平台统一，而且共享了相同的元数据、SQL语法（基于Hive SQL）、ODBC驱动程序以及用户界面（如Hue中的Impala查询UI），这使得用户可以在实时查询和批处理查询之间无缝切换，提供了一个熟悉且统一的平台。 Impala的设计目标是解决传统Hadoop系统中数据分析速度较慢的问题，通过优化查询执行引擎和内存管理，实现了低延迟的查询性能。它可以处理大规模的数据集，适合大数据分析和商业智能应用。此外，由于与Hive的高度兼容性，Impala使得已经投资于Hive的企业无需进行大规模重构，就能享受到更高效的查询性能。在Impala中，用户可以通过编写SQL查询来获取数据洞察，这些查询可以运行在分布式计算集群上，充分利用多节点并行处理能力。Impala的架构包括协调节点（Cordinator Node）和数据节点（Data Node），协调节点负责解析和优化查询，分配任务给数据节点，而数据节点则执行实际的数据处理工作。 Impala支持多种数据格式，如Parquet、Avro、Text和SequenceFile等，其中Parquet因其列式存储和压缩特性，通常能提供最佳的查询性能。同时，Impala还支持复杂的查询操作，如JOIN、GROUP BY、窗口函数等，这对于复杂的数据分析任务至关重要。在安全方面，Impala可以与Hadoop的权限管理系统（如Kerberos）集成，实现细粒度的访问控制。此外，Impala还可以与其他Hadoop组件，如Hive、HBase、Sentry等协同工作，形成一个完整的数据处理和分析生态。值得注意的是，虽然Impala和Hive在很多方面相似，但它们有各自的设计哲学和优化重点。Impala专注于交互式查询，而Hive更适合长时间运行的批处理作业。因此，选择使用Impala还是Hive，通常取决于具体的应用场景和性能需求。 Impala是Hadoop生态系统中的一员，为大数据分析提供了快速响应的SQL查询能力，使得企业能够更加灵活地处理和分析海量数据，同时保持与现有Hadoop工具的兼容性。在使用Impala时，应遵循相关的版权法律法规，尊重并保护知识产权。

Issues Fixed in the 2.1.4 Release / CDH 5.3.4.....................................................................................................................724

Issues Fixed in the 2.1.3 Release / CDH 5.3.3.....................................................................................................................725

Issues Fixed in the 2.1.2 Release / CDH 5.3.2.....................................................................................................................726

Issues Fixed in the 2.1.1 Release / CDH 5.3.1.....................................................................................................................727

Issues Fixed in the 2.1.0 Release / CDH 5.3.0.....................................................................................................................727

Issues Fixed in the 2.0.5 Release / CDH 5.2.6.....................................................................................................................728

Issues Fixed in the 2.0.4 Release / CDH 5.2.5.....................................................................................................................728

Issues Fixed in the 2.0.3 Release / CDH 5.2.4.....................................................................................................................728

Issues Fixed in the 2.0.2 Release / CDH 5.2.3.....................................................................................................................729

Issues Fixed in the 2.0.1 Release / CDH 5.2.1.....................................................................................................................729

Issues Fixed in the 2.0.0 Release / CDH 5.2.0.....................................................................................................................730

Issues Fixed in the 1.4.4 Release / CDH 5.1.5.....................................................................................................................731

Issues Fixed in the 1.4.3 Release / CDH 5.1.4.....................................................................................................................731

Issues Fixed in the 1.4.2 Release / CDH 5.1.3.....................................................................................................................732

Issues Fixed in the 1.4.1 Release / CDH 5.1.2.....................................................................................................................732

Issues Fixed in the 1.4.0 Release / CDH 5.1.0.....................................................................................................................733

Issues Fixed in the 1.3.3 Release / CDH 5.0.5.....................................................................................................................734

Issues Fixed in the 1.3.2 Release / CDH 5.0.4.....................................................................................................................734

Issues Fixed in the 1.3.1 Release / CDH 5.0.3.....................................................................................................................735

Issues Fixed in the 1.3.0 Release / CDH 5.0.0.....................................................................................................................736

Issues Fixed in the 1.2.4 Release........................................................................................................................................738

Issues Fixed in the 1.2.3 Release........................................................................................................................................739

Issues Fixed in the 1.2.2 Release........................................................................................................................................739

Issues Fixed in the 1.2.1 Release........................................................................................................................................740

Issues Fixed in the 1.2.0 Beta Release................................................................................................................................741

Issues Fixed in the 1.1.1 Release........................................................................................................................................741

Issues Fixed in the 1.1.0 Release........................................................................................................................................742

Issues Fixed in the 1.0.1 Release........................................................................................................................................742

Issues Fixed in the 1.0 GA Release.....................................................................................................................................744

Issues Fixed in Version 0.7 of the Beta Release..................................................................................................................746

Issues Fixed in Version 0.6 of the Beta Release..................................................................................................................747

Issues Fixed in Version 0.5 of the Beta Release..................................................................................................................748

Issues Fixed in Version 0.4 of the Beta Release..................................................................................................................748

Issues Fixed in Version 0.3 of the Beta Release..................................................................................................................749

Issues Fixed in Version 0.2 of the Beta Release..................................................................................................................749

Introducing Apache Impala (incubating)

Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the

Amazon Simple Storage Service (S3). In addition to using the same unified storage platform, Impala also uses the same

metadata, SQL syntax (Hive SQL), ODBC driver, anduser interface(Impala query UI in Hue) as Apache Hive. This provides

a familiar and unified platform for real-time or batch-oriented queries.

Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks

built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running

batch jobs, such as those involving batch processing of Extract, Transform, and Load (ETL) type jobs.

Note: Impala was accepted into the Apache incubator on December 2, 2015. In places where the

documentation formerly referred to “Cloudera Impala”, now the official name is “Apache Impala

(incubating)”.

Impala Benefits

Impala provides:

• Familiar SQL interface that data scientists and analysts already know.

• Ability to query high volumes of data (“big data”) in Apache Hadoop.

• Distributed queries in a cluster environment, for convenient scaling and to make use of cost-effective commodity

hardware.

• Ability to share data files between different components with no copy or export/import step; for example, to

write with Pig, transform with Hive and query with Impala. Impala can read from and write to Hive tables, enabling

simple data interchange using Impala for analytics on Hive-produced data.

• Single system for big data processing and analytics, so customers can avoid costly modeling and ETL just for

analytics.

How Impala Works with CDH

The following graphic illustrates how Impala is positioned in the broader Cloudera environment:

The Impala solution is composed of the following components:

• Clients - Entities including Hue, ODBC clients, JDBC clients, and the Impala Shell can all interact with Impala. These

interfaces are typically used to issue queries or complete administrative tasks such as connecting to Impala.

Impala Guide | 17

Introducing Apache Impala (incubating)

• Hive Metastore - Stores information about the data available to Impala. For example, the metastore lets Impala

know what databases are available and what the structure of those databases is. As you create, drop, and alter

schema objects, load data into tables, and so on through Impala SQL statements, the relevant metadata changes

are automatically broadcast to all Impala nodes by the dedicated catalog service introduced in Impala 1.2.

• Impala - This process, which runs on DataNodes, coordinates and executes queries. Each instance of Impala can

receive, plan, and coordinate queries from Impala clients. Queries are distributed among Impala nodes, and these

nodes then act as workers, executing parallel query fragments.

• HBase and HDFS - Storage for data to be queried.

Queries executed using Impala are handled as follows:

1. User applications send SQL queries to Impala through ODBC or JDBC, which provide standardized querying

interfaces. The user application may connect to any impalad in the cluster. This impalad becomes the coordinator

for the query.

2. Impala parses the query and analyzes it to determine what tasks need to be performed by impalad instances

across the cluster. Execution is planned for optimal efficiency.

3. Services such as HDFS and HBase are accessed by local impalad instances to provide data.

4. Each impalad returns data to the coordinating impalad, which sends these results to the client.

Primary Impala Features

Impala provides support for:

• Most common SQL-92 features of Hive Query Language (HiveQL) including SELECT, joins, and aggregate functions.

• HDFS, HBase, and Amazon Simple Storage System (S3) storage, including:

– HDFS file formats: Text file, SequenceFile, RCFile, Avro file, and Parquet.

– Compression codecs: Snappy, GZIP, Deflate, BZIP.

• Common data access interfaces including:

– JDBC driver.

– ODBC driver.

– Hue Beeswax and the Impala Query UI.

• Impala command-line interface.

• Kerberos authentication.

18 | Impala Guide

Introducing Apache Impala (incubating)

Impala Concepts and Architecture

The following sections provide background information to help you become productive using Impala and its features.

Where appropriate, the explanations include context to help understand how aspects of Impala relate to other

technologies you might already be familiar with, such as relational database management systems and data warehouses,

or other Hadoop components such as Hive, HDFS, and HBase.

Components of the Impala Server

The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of different daemon

processes that run on specific hosts within your CDH cluster.

The Impala Daemon

The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented

by the impalad process. It reads and writes to data files; accepts queries transmitted from the impala-shell

command, Hue, JDBC, or ODBC; parallelizes the queries and distributes work across the cluster; and transmits

intermediate query results back to the central coordinator node.

You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as

the coordinator node for that query. The other nodes transmit partial results back to the coordinator, which constructs

the final result set for a query. When running experiments with functionality through the impala-shell command,

you might always connect to the same Impala daemon for convenience. For clusters running production workloads,

you might load-balance by submitting each query to a different Impala daemon in round-robin style, using the JDBC

or ODBC interfaces.

The Impala daemons are in constant communication with the statestore, to confirm which nodes are healthy and can

accept new work.

They also receive broadcast messages from the catalogd daemon (introduced in Impala 1.2) whenever any Impala

node in the cluster creates, alters, or drops any type of object, or when an INSERT or LOAD DATA statement is processed

through Impala. This background communication minimizes the need for REFRESH or INVALIDATE METADATA

statements that were needed to coordinate metadata across nodes prior to Impala 1.2.

Related information: Modifying Impala Startup Options on page 49, Starting Impala on page 48, Setting the Idle Query

and Idle Session Timeouts for impalad on page 96, Ports Used by Impala on page 638, Using Impala through a Proxy

for High Availability on page 97

The Impala Statestore

The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a

cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process

named statestored; you only need such a process on one host in the cluster. If an Impala daemon goes offline due

to hardware failure, network error, software issue, or other reason, the statestore informs all the other Impala daemons

so that future queries can avoid making requests to the unreachable node.

Because the statestore's purpose is to help when things go wrong, it is not critical to the normal operation of an Impala

cluster. If the statestore is not running or becomes unreachable, the Impala daemons continue running and distributing

work among themselves as usual; the cluster just becomes less robust if other Impala daemons fail while the statestore

is offline. When the statestore comes back online, it re-establishes communication with the Impala daemons and

resumes its monitoring function.

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and

catalogd daemons do not have special requirements for high availability, because problems with those daemons do

not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the

Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and

restart the Impala service.

Impala Guide | 19

Impala Concepts and Architecture

Related information:

Scalability Considerations for the Impala Statestore on page 560, Modifying Impala Startup Options on page 49, Starting

Impala on page 48, Increasing the Statestore Timeout on page 96, Ports Used by Impala on page 638

The Impala Catalog Service

The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all

the DataNodes in a cluster. It is physically represented by a daemon process named catalogd; you only need such a

process on one host in the cluster. Because the requests are passed through the statestore daemon, it makes sense

to run the statestored and catalogd services on the same host.

The catalog service avoids the need to issue REFRESH and INVALIDATE METADATA statements when the metadata

changes are performed by statements issued through Impala. When you create a table, load data, and so on through

Hive, you do need to issue REFRESH or INVALIDATE METADATA on an Impala node before executing a query there.

This feature touches a number of aspects of Impala:

• See Installing Impala on page 30, Upgrading Impala on page 44 and Starting Impala on page 48, for usage

information for the catalogd daemon.

• The REFRESH and INVALIDATE METADATA statements are not needed when the CREATE TABLE, INSERT, or

other table-changing or data-changing operation is performed through Impala. These statements are still needed

if such operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the

statements only need to be issued on one Impala node rather than on all nodes. See REFRESH Statement on page

302 and INVALIDATE METADATA Statement on page 296 for the latest usage information for those statements.

By default, the metadata loading and caching on startup happens asynchronously, so Impala can begin accepting

requests promptly. To enable the original behavior, where Impala waited until all metadata was loaded before accepting

any requests, set the catalogd configuration option --load_catalog_in_background=false.

Most considerations for load balancing and high availability apply to the impalad daemon. The statestored and

catalogd daemons do not have special requirements for high availability, because problems with those daemons do

not result in data loss. If those daemons become unavailable due to an outage on a particular host, you can stop the

Impala service, delete the Impala StateStore and Impala Catalog Server roles, add the roles on a different host, and

restart the Impala service.

Note:

In Impala 1.2.4 and higher, you can specify a table name with INVALIDATE METADATA after the table

is created in Hive, allowing you to make individual tables visible to Impala without doing a full reload

of the catalog metadata. Impala 1.2.4 also includes other changes to make the metadata broadcast

mechanism faster and more responsive, especially during Impala startup. See New Features in Impala

1.2.4 on page 678 for details.

Related information: Modifying Impala Startup Options on page 49, Starting Impala on page 48, Ports Used by Impala

on page 638

Developing Impala Applications

The core development language with Impala is SQL. You can also use Java or other languages to interact with Impala

through the standard JDBC and ODBC interfaces used by many business intelligence tools. For specialized kinds of

analysis, you can supplement the SQL built-in functions by writing user-defined functions (UDFs) in C++ or Java.

Overview of the Impala SQL Dialect

The Impala SQL dialect is highly compatible with the SQL syntax used in the Apache Hive component (HiveQL). As such,

it is familiar to users who are already familiar with running SQL queries on the Hadoop infrastructure. Currently, Impala

20 | Impala Guide

Impala Concepts and Architecture

剩余749页未读，继续阅读

PyQter

粉丝: 14
资源: 39

Cloudera Impala指南：快速Hadoop数据分析

Apache Impala (incubating)：快速交互式大数据查询系统

go-impala: 高性能Go语言驱动连接Apache Impala

JDBC-impala驱动包深度解析：Java连接Impala的关键

Apache Impala Guide impala-3.3.pdf

Apache impala-3.2 Guide.pdf

Impala

Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark

impala-2.8

impala_jdbc.zip

最新资源