Impala：Hadoop上的现代开源SQL引擎

需积分: 9 123 浏览量更新于2024-09-05 收藏 1.28MB PDF 举报

"Impala：一个现代、开源的Hadoop SQL引擎" Impala是由Cloudera开发的一款开源的、大规模并行处理（MPP）SQL查询引擎，专门为利用Hadoop数据处理环境的灵活性和可扩展性而设计。它旨在提供低延迟和高并发性，主要用于Hadoop上的商业智能（BI）和分析查询，这些是传统的批处理框架如Apache Hive无法提供的。 1. **介绍** Impala的创新之处在于其从一开始就针对Hadoop进行了优化，它的设计目标是将实时查询性能与Hadoop的大数据存储能力相结合。这使得用户可以通过SQL直接对Hadoop集群进行快速查询，而无需将数据导出到其他系统。 2. **架构和主要组件** - **查询处理**：Impala支持标准的SQL语法，允许用户通过简单的SQL查询来访问Hadoop中的数据。它有一个解析器来解析查询，然后生成执行计划。 - **分布式执行**：Impala采用MPP架构，这意味着查询任务被分解为多个子任务，由分布在不同节点上的进程并行执行，提高了查询速度。 - **内存优化**：Impala在内存中缓存数据，减少了磁盘I/O，从而实现更快的响应时间。 - **元数据管理**：Impala有自己的元数据服务，用于跟踪数据的分区和位置，以优化查询路由。 - **无须预编译**：与Hive不同，Impala不需要预先编译查询，这进一步减少了查询延迟。 3. **性能优势** 文档中提到了Impala相对于其他SQL-on-Hadoop系统的优越性能，这可能体现在以下几个方面： - **查询速度**：Impala通常比Hive快几个数量级，因为它避免了MapReduce的开销。 - **并发性**：Impala可以同时处理大量并发查询，适合多用户环境。 - **交互式查询**：Impala适合于交互式数据分析，用户可以快速获取结果，无需等待长时间的批处理完成。 4. **应用场景** - **实时分析**：Impala适用于需要实时或近实时分析的业务场景，如广告定向、用户行为分析等。 - **大数据BI**：它能够与BI工具（如Tableau、Looker等）无缝集成，提供快速的数据洞察。 - **数据探索**：数据科学家和分析师可以使用Impala快速探索和验证假设，无需等待数据提取和转换。 5. **生态系统集成** Impala不仅与Hadoop生态系统紧密集成，如HDFS和HBase，还与其他Cloudera产品如Hue、Sentry和Hadoop的YARN资源管理器协同工作。 6. **结论** Impala作为一个现代的SQL引擎，提供了对Hadoop数据的高效访问，是实时分析和大数据BI的重要工具。其设计和实现充分考虑了性能、可扩展性和易用性，使得用户能够在Hadoop环境中享受到与传统关系数据库类似的交互体验。

Impala: A Modern, Open-Source SQL Engine for Hadoop

Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky

Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht

Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang

Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus

John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder

Cloudera

http://impala.io/

ABSTRACT

Cloudera Impala is a modern, open-source MPP SQL en-

gine architected from the ground up for the Hadoop data

processing environment. Impala provides low latency and

high concurrency for BI/analytic read-mostly queries on

Hadoop, not delivered by batch frameworks such as Apache

Hive. This paper presents Impala from a user’s perspective,

gives an overview of its architecture and main components

and brieﬂy demonstrates its superior performance compared

against other popular SQL-on-Hadoop systems.

1. INTRODUCTION

Impala is an open-source

, fully-integrated, state-of-the-

art MPP SQL query engine designed speciﬁcally to leverage

the ﬂexibility and scalability of Hadoop. Impala’s goal is

to combine the familiar SQL support and multi-user perfor-

mance of a traditional analytic database with the scalability

and ﬂexibility of Apache Hadoop and the production-grade

security and management extensions of Cloudera Enterprise.

Impala’s beta release was in October 2012 and it GA’ed in

May 2013. The most recent version, Impala 2.0, was released

in October 2014. Impala’s ecosystem momentum continues

to accelerate, with nearly one million downloads since its

GA.

Unlike other systems (often forks of Postgres), Impala is a

brand-new engine, written from the ground up in C++ and

Java. It maintains Hadoop’s ﬂexibility by utilizing standard

components (HDFS, HBase, Metastore, YARN, Sentry) and

is able to read the majority of the widely-used ﬁle formats

(e.g. Parquet, Avro, RCFile). To reduce latency, such as

that incurred from utilizing MapReduce or by reading data

remotely, Impala implements a distributed architecture based

on daemon processes that are responsible for all aspects of

query execution and that run on the same machines as the

rest of the Hadoop infrastructure. The result is performance

https://github.com/cloudera/impala

This article is published under a Creative Commons Attribution Li-

cense(http://creativecommons.org/licenses/by/3.0/), which permits distri-

bution and reproduction in any medium as well as allowing derivative

works, provided that you attribute the original work to the author(s) and

CIDR 2015.

7th Biennial Conference on Innovative Data Systems Research (CIDR’15)

January 4-7, 2015, Asilomar, California, USA.

that is on par or exceeds that of commercial MPP analytic

DBMSs, depending on the particular workload.

This paper discusses the services Impala provides to the

user and then presents an overview of its architecture and

main components. The highest performance that is achiev-

able today requires using HDFS as the underlying storage

manager, and therefore that is the focus on this paper; when

there are notable diﬀerences in terms of how certain technical

aspects are handled in conjunction with HBase, we note that

in the text without going into detail.

Impala is the highest performing SQL-on-Hadoop system,

especially under multi-user workloads. As

Section 7

shows,

for single-user queries, Impala is up to 13x faster than alter-

natives, and 6.7x faster on average. For multi-user queries,

the gap widens: Impala is up to 27.4x faster than alternatives,

and 18x faster on average – or nearly three times faster on

average for multi-user queries than for single-user ones.

The remainder of this paper is structured as follows: the

next section gives an overview of Impala from the user’s

perspective and points out how it diﬀers from a traditional

RDBMS.

Section 3

presents the overall architecture of the

system.

Section 4

presents the frontend component, which

includes a cost-based distributed query optimizer,

Section 5

presents the backend component, which is responsible for the

query execution and employs runtime code generation, and

Section 6

presents the resource/workload management com-

ponent.

Section 7

brieﬂy evaluates the performance of Im-

pala.

Section 8

discusses the roadmap ahead and

Section 9

concludes.

2. USER VIEW OF IMPALA

Impala is a query engine which is integrated into the

Hadoop environment and utilizes a number of standard

Hadoop components (Metastore, HDFS, HBase, YARN, Sen-

try) in order to deliver an RDBMS-like experience. However,

there are some important diﬀerences that will be brought up

in the remainder of this section.

Impala was speciﬁcally targeted for integration with stan-

dard business intelligence environments, and to that end

supports most relevant industry standards: clients can con-

nect via ODBC or JDBC; authentication is accomplished

with Kerberos or LDAP; authorization follows the standard

SQL roles and privileges

. In order to query HDFS-resident

This is provided by another standard Hadoop component

called Sentry

[4]

, which also makes role-based authoriza-

tion available to Hive, and other components.

下载后可阅读完整内容，剩余9页未读，立即下载

ddttoop

粉丝: 47
资源: 16

Impala：Hadoop上的现代开源SQL引擎

Cloudera-JDBC-Driver-for-Impala-Install-Guide.pdf

hadoop-2.6.0.tar.gz&hadoop-2.6.0-cdh5.16.2.tar.gz

Hadoop数据迁移--从Oracle向Hadoop.zip

impala-udf-devel-1.4.0-1.impala1.4.0.p0.7.el6.x86_64.rpm

Python库 | vdk-impala-0.1.426832659.tar.gz

Python库 | vdk-impala-0.1.415625538.tar.gz

apache-atlas-2.2.0-impala-hook.tar.gz

apache-atlas-2.1.0-impala-hook.tar.gz

1-4-Hive+and+Impala.pdf

spark-tut-2016-intro.pdf

最新资源