Orca查询优化器：Greenplum与大数据分析的核心

需积分: 0 59 浏览量更新于2024-08-05 收藏 1.29MB PDF 举报

"greenplum--orca查询优化器详解1" 在大数据分析领域，查询优化器是数据管理系统性能的关键因素。随着数据量的增加和复杂分析查询需求的上升，Pivotal公司开发了一种新的查询优化器——Orca。Orca是Pivotal所有数据管理产品的核心，包括Pivotal Greenplum Database和Pivotal HAWQ。 Orca查询优化器的架构设计具有创新性和模块化特性，它结合了最先进的查询优化技术与Pivotal的原创研究成果。这种模块化设计使得Orca具有高度可移植性，可以在不同的数据处理环境中灵活应用。Orca的核心目标是提高查询性能，确保大规模数据分析的效率和准确性。 Orca的工作流程主要包括以下步骤： 1. **解析和重写**：输入的SQL查询首先被解析成抽象语法树（AST），然后进行语义分析和查询重写，如消除冗余操作、合并连接等，优化查询结构。 2. **统计信息收集**：Orca利用统计信息来估计查询执行计划的成本，这些信息包括表的大小、列的分布情况、索引使用频率等，以帮助选择最优的执行策略。 3. **查询规划**：基于成本模型，Orca生成可能的执行计划，并通过比较各种计划的成本来选择最佳计划。这包括决定查询的执行顺序、是否使用并行处理、选择合适的连接算法等。 4. **执行优化**：在执行阶段，Orca可以动态调整执行计划，如根据实际运行时的统计信息调整扫描策略，或者在执行过程中进行子计划的重新排序。 Orca的模块化架构允许其组件独立升级和改进，例如，可以通过添加新的优化规则或改进现有的成本估算方法来提升性能。此外，Orca支持多种查询优化技术，包括多阶段查询优化、Cascading Rule-Based Optimization（CRBO）以及基于成本的优化。在应对大数据分析挑战时，Orca特别关注并行处理和分布式计算。由于Pivotal Greenplum和HAWQ都是分布式数据库系统，Orca能够有效地在多节点之间分配工作负载，利用所有可用资源，提高查询速度。 Orca查询优化器是Pivotal应对大数据时代查询性能挑战的重要工具，它的模块化设计和先进的优化策略使得它在处理大规模复杂查询时表现优秀，从而提升了整个数据管理系统的效能。

Orca%

Database%System%

Parser! Catalog! Executor!

DXL!Query! DXL!MD! DXL!Plan!

Query2DXL! DXL2Plan!

Query!

Results!

MD!Provider!

Figure 2: Interaction of Orca with database system

Search'

Property'Enforcement'

Memory'Manager'

Concurrency'

Control'

GPOS%

OS%

Orca%

Operators'

Transforma9ons'

Card.'Es9ma9on'

Cost'Model'

Op*mizer%Tools%

Job'Scheduler'

File'I/O'

Memo%

DXL'Query'

DXL'Plan'

MD'Cache'

Excep9on'

Handling'

Figure 3: Orca architecture

for communication, such as input queries, output plans and

metadata. Overlaid on DXL is a simple communication pro-

tocol to send the initial query structure and retrieve the

optimized plan. A major beneﬁt of DXL is packaging Orca

as a stand-alone product.

Figure 2 shows the interaction between Orca and an ex-

ternal database system. The input to Orca is a DXL query.

The output of Orca is a DXL plan. During optimization,

the database system can be queried for metadata (e.g., ta-

ble deﬁnitions). Orca abstracts metadata access details by

allowing database system to register a metadata provider

(MD Provider) that is responsible for serializing metadata

into DXL before being sent to Orca. Metadata can also be

consumed from regular ﬁles containing metadata objects se-

rialized in DXL format.

The database system needs to include translators that

consume/emit data in DXL format. Query2DXL transla-

tor converts a query parse tree into a DXL query, while

DXL2Plan translator converts a DXL plan into an executable

plan. The implementation of such translators is done com-

pletely outside Orca, which allows multiple systems to use

Orca by providing the appropriate translators.

The architecture of Orca is highly extensible; all compo-

nents can be replaced individually and conﬁgured separately.

Figure 3 shows the diﬀerent components of Orca. We brieﬂy

describe these components as follows.

Memo. The space of plan alternatives generated by the

optimizer is encoded in a compact in-memory data struc-

ture called the Memo [13]. The Memo structure consists of

a set of containers called groups, where each group contains

logically equivalent expressions. Memo groups capture the

diﬀerent sub-goals of a query (e.g., a ﬁlter on a table, or a

join of two tables). Group members, called group expres-

sions, achieve the group goal in diﬀerent logical ways (e.g.,

diﬀerent join orders). Each group expression is an operator

that has other groups as its children. This recursive struc-

ture of the Memo allows compact encoding of a huge space

of possible plans as we illustrate in Section 4.1.

Search and Job Scheduler. Orca uses a search mecha-

nism to navigate through the space of possible plan alter-

natives and identify the plan with the least estimated cost.

The search mechanism is enabled by a specialized Job Sched-

uler that creates dependent or parallel work units to perform

query optimization in three main steps: exploration, where

equivalent logical expressions are generated, implementation

where physical plans are generated, and optimization, where

required physical properties (e.g., sort order) are enforced

and plan alternatives are costed. We discuss the details of

optimization jobs scheduling in Section 4.2.

Transformations. [13] Plan alternatives are generated

by applying transformation rules that can produce either

equivalent logical expressions (e.g., InnerJoin(A,B) → In-

nerJoin(B,A)), or physical implementations of existing ex-

pressions (e.g., Join(A,B) → HashJoin(A,B)). The results of

applying transformation rules are copied-in to the Memo,

which may result in creating new groups and/or adding new

group expressions to existing groups. Each transformation

rule is a self-contained component that can be explicitly ac-

tivated/deactivated in Orca conﬁgurations.

Property Enforcement. Orca includes an extensible

framework for describing query requirements and plan char-

acteristics based on formal property speciﬁcations. Prop-

erties have diﬀerent types including logical properties (e.g.,

output columns), physical properties (e.g., sort order and

data distribution), and scalar properties (e.g., columns used

in join conditions). During query optimization, each oper-

ator may request speciﬁc properties from its children. An

optimized child plan may either satisfy the required proper-

ties on its own (e.g., an IndexScan plan delivers sorted data),

or an enforcer (e.g., a Sort operator) needs to be plugged in

the plan to deliver the required property. The framework

allows each operator to control enforcers placement based

on child plans’ properties and operator’s local behavior. We

describe this framework in more detail in Section 4.1.

Metadata Cache. Since metadata (e.g., table deﬁnitions)

changes infrequently, shipping it with every query incurs an

overhead. Orca caches metadata on the optimizer side and

only retrieves pieces of it from the catalog if something is

unavailable in the cache, or has changed since the last time

it was loaded in the cache. Metadata cache also abstracts

the database system details from the optimizer, which is

particularly useful during testing and debugging.

GPOS. In order to interact with operating systems with

possibly diﬀerent APIs, Orca uses an OS abstraction layer

called GPOS. The GPOS layer provides Orca with an exten-

sive infrastructure including a memory manager, primitives

for concurrency control, exception handling, ﬁle I/O and

synchronized data structures.

4. QUERY OPTIMIZATION

We describe Orca’s optimization workﬂow in Section 4.1.

We then show how the optimization process can be con-

ducted in parallel in Section 4.2.

339

剩余11页未读，继续阅读

点墨楼

粉丝: 37
资源: 279

Orca查询优化器：Greenplum与大数据分析的核心

greenplum-db（oopen-source-greenplum-db-6.19.0-rhel7-x86_64.rpm）

RHEL4-U4-x86_64-AS Oracle.10g.10201_database_linux_x86_64 安装文档

greenplum 优化

在执行Dockerfile时，安装greenplum-db-clients-6.24.3 的rpm包时提示缺少依赖，有什么解决方案？或者是否能提供其它的方式将greenplum-db-clients打入到镜像中？

如何将本地已下载好的“greenplum-db-clients”用docker打成镜像？需要什么前置条件？

如何将本地已下载好的“greenplum-db-clients”用docker打成镜像？需要什么前置条件？基础镜像能选择centos吗？能的话请给个示例

open-source-greenplum-db-6.19.1-rhel7-x86_64.rpm

将greenplum-db-clients打入到docker镜像中，怎么才能让gpfdist的服务随着容器启动而启动？

greenplum-db-6.24.0-ubuntu18.04-amd64.deb

greenplum greenplum-db-6.13.0-rhel7-x86_64.rpm分布式部署如何部署

最新资源