Facebook Presto：适应性强的分布式SQL查询引擎

需积分: 13 156 浏览量更新于2024-08-26 收藏 543KB PDF 举报

Presto-SQL-on-Everything.pdf 是一篇关于开源分布式查询引擎 Presto 的深度研究论文。Presto 是 Facebook 使用的一种强大工具，专为支持 SQL 分析工作负载而设计，旨在提供高效、灵活和可扩展的解决方案。它不仅仅适用于用户层面的报告应用，这些应用需要毫秒级的响应时间，还涵盖了处理大规模数据的长时间ETL（提取、转换、加载）任务，包括对海量数据的聚合和联接。论文作者 Raghav Sethi 等人强调了 Presto 的核心特性，即其广泛的适用性，它能够连接到多种数据源，如 Hadoop 数据仓库、关系型数据库管理系统（RDBMS）、NoSQL 数据存储系统以及流处理系统。这得益于 Presto 引入的插件式架构，允许开发者通过 Connector API 提供高性能的 I/O 接口，从而无缝集成不同的数据环境。在文章中，作者详细列举了 Facebook 在实际场景中使用 Presto 的一些例子，展示了其在报告查询、实时分析和大数据处理中的表现。论文深入剖析了 Presto 的架构和实现细节，特别是那些对性能优化的关键决策，如查询计划、分布式计算和内存管理等。最后，作者提供了性能评估结果，证明了 Presto 设计决策对其在各种复杂工作负载下的效率和响应能力产生的积极影响。这不仅展示了 Presto 在 SQL 查询处理中的强大功能，也揭示了其在大型企业环境中如何作为统一查询平台提高数据处理效率和灵活性。 Presto-SQL-on-Everything.pdf 是一篇技术导向的文章，对于理解分布式查询引擎的构建原则、优化策略以及在实际生产环境中如何选择和利用此类工具具有很高的参考价值。阅读这篇论文可以帮助读者深入了解如何在大规模数据处理场景下，通过 SQL 查询实现高效、灵活的数据分析和操作。

by evaluating queue policies, parsing and analyzing the SQL

text, creating and optimizing distributed execution plan.

The coordinator distributes this plan to workers, starts exe-

cution of tasks and then begins to enumerate splits, which are

opaque handles to an addressable chunk of data in an external

storage system. Splits are assigned to the tasks responsible for

reading this data.

Worker nodes running these tasks process these splits by

fetching data from external systems, or process intermediate

results produced by other workers. Workers use co-operative

multi-tasking to process tasks from many queries concurrently.

Execution is pipelined as much as possible, and data ﬂows

between tasks as it becomes available. For certain query

shapes, Presto is capable of returning results before all the

data is processed. Intermediate data and state is stored in-

memory whenever possible. When shufﬂing data between

nodes, buffering is tuned for minimal latency.

Presto is designed to be extensible; and provides a versa-

tile plugin interface. Plugins can provide custom data types,

functions, access control implementations, event consumers,

queuing policies, and conﬁguration properties. More impor-

tantly, plugins also provide connectors, which enable Presto to

communicate with external data stores through the Connector

API, which is composed of four parts: the Metadata API, Data

Location API, Data Source API, and Data Sink API. These

APIs are designed to allow performant implementations of

connectors within the environment of a physically distributed

execution engine. Developers have contributed over a dozen

connectors to the main Presto repository, and we are aware of

several proprietary connectors.

IV. SYSTEM DESIGN

In this section we describe some of the key design decisions

and features of the Presto engine. We describe the SQL dialect

that Presto supports, then follow the query lifecycle all the way

from client to distributed execution. We also describe some

of the resource management mechanisms that enable multi-

tenancy in Presto. Finally, we brieﬂy discuss fault tolerance.

A. SQL Dialect

Presto closely follows the ANSI SQL speciﬁcation [2]. While

the engine does not implement every feature described, im-

plemented features conform to the speciﬁcation as far as

possible. We have made a few carefully chosen extensions to

the language to improve usability. For example, it is difﬁcult

to operate on complex data types, such as maps and arrays,

in ANSI SQL. To simplify operating on these common data

types, Presto syntax supports anonymous functions (lambda

expressions) and built-in higher-order functions (e.g., trans-

form, ﬁlter, reduce).

B. Client Interfaces, Parsing, and Planning

1) Client Interfaces: The Presto coordinator primarily ex-

poses a RESTful HTTP interface to clients, and ships with

a ﬁrst-class command line interface. Presto also ships with a

JDBC client, which enables compatibility with a wide variety

of BI tools, including Tableau and Microstrategy.

2) Parsing: Presto uses an ANTLR-based parser to convert

SQL statements into a syntax tree. The analyzer uses this

tree to determine types and coercions, resolve functions and

scopes, and extracts logical components, such as subqueries,

aggregations, and window functions.

3) Logical Planning: The logical planner uses the syntax

tree and analysis information to generate an intermediate

representation (IR) encoded in the form of a tree of plan nodes.

Each node represents a physical or logical operation, and the

children of a plan node are its inputs. The planner produces

nodes that are purely logical, i.e. they do not contain any

information about how the plan should be executed. Consider

a simple query:

SELECT

orders.orderkey, SUM(tax)

FROM orders

LEFT JOIN lineitem

ON orders.orderkey = lineitem.orderkey

WHERE discount = 0

GROUP BY orders.orderkey

The logical plan for this query is outlined in Figure 2.

Aggregate [SUM(tax)]

LeftJoin [ON orderkey]

Scan [orders]

Filter [discount=0]

Scan [lineitem]

Fig. 2. Logical Plan

C. Query Optimization

The plan optimizer transforms the logical plan into a more

physical structure that represents an efﬁcient execution strategy

for the query. The process works by evaluating a set of

transformation rules greedily until a ﬁxed point is reached.

Each rule has a pattern that can match a sub-tree of the

query plan and determines whether the transformation should

be applied. The result is a logically equivalent sub-plan that

replaces the target of the match. Presto contains several rules,

including well-known optimizations such as predicate and

limit pushdown, column pruning, and decorrelation.

We are in the process of enhancing the optimizer to perform

a more comprehensive exploration of the search space using

a cost-based evaluation of plans based on the techniques

introduced by the Cascades framework [13]. However, Presto

already supports two cost-based optimizations that take table

and column statistics into account - join strategy selection and

join re-ordering. We will discuss only a few features of the

optimizer; a detailed treatment is out of the scope of this paper.

1) Data Layouts: The optimizer can take advantage of

the physical layout of the data when it is provided by the

connector Data Layout API. Connectors report locations and

other data properties such as partitioning, sorting, grouping,

剩余11页未读，继续阅读

边城水手

粉丝: 113
资源: 35

Facebook Presto：适应性强的分布式SQL查询引擎

Presto_SQL_on_Everything.pdf

Presto SQL on Everything

presto-sql.txt

presto-cli-0.244.1-executable-noarch.jar

presto-cli-0.266.1-executable.jar

presto-cli-0.223-executable.jar

presto-cli-0.184-executable.jar

presto-cli-0.191-executable.jar

presto-jdbc-0.238.1.jar中文-英文对照文档.zip

presto-benchmark-driver-0.100-executable

最新资源