Apache Calcite：优化跨异构数据源查询处理的核心框架

需积分: 11 103 浏览量更新于2024-08-31 收藏 1.21MB PDF 举报

Apache Calcite是一个基础软件框架，专为优化异构数据源上的查询处理而设计。该框架由Edmon Begoli、Jesús Camacho-Rodríguez、Julian Hyde等人合作开发，旨在为诸如Apache Hive、Apache Storm、Apache Flink、Druid和MapD等流行的开源数据处理系统提供统一的查询处理、优化以及查询语言支持。其核心目标是将Calcite引入更广泛的科研社区，通过本文介绍其背景、发展史、架构特点以及应用模式，帮助开发者理解和利用这一强大工具。 Calcite架构的关键在于其模块化和可扩展的设计，这使得它能够适应各种复杂的数据处理场景，无论是大数据处理、实时分析还是交互式查询。它的优化功能基于先进的算法和技术，能够在处理大量异构数据时，高效地执行查询计划，从而提升系统的性能和响应速度。此外，Calcite还支持多种查询语言，如SQL，使得开发者可以无缝集成到他们现有的技术栈中。本文首先概述了Calcite的发展历程，从最初的设想到现在的广泛应用，展示了它如何逐渐成为一个业界公认的基石。然后，深入解析了Calcite的架构细节，包括其核心组件如解析器、优化器、执行引擎以及数据模型管理，这些组件如何协同工作以提供高效的查询处理能力。同时，文章也讨论了Calcite的特性，比如动态规划、类型推断、元数据管理等，这些都是其在优化查询性能方面的关键要素。对于想要采用Calcite的用户，文章会介绍如何有效地集成和定制框架，提供最佳实践和案例研究，帮助开发者理解和利用框架的优势。此外，文章还探讨了Calcite与当前主流大数据生态系统的关系，以及未来可能的发展方向，例如支持新的数据存储格式、云计算环境下的优化策略等。 Apache Calcite作为一个强大的IT基础设施，不仅对数据处理领域的研究人员具有重要意义，也为实际项目中的开发者提供了丰富的功能和灵活性。通过本文的详细介绍，读者不仅能深入了解Calcite的工作原理，还能获得如何在自己的项目中有效应用它的实用指南。

Apache Calcite: A Foundational Framework for Optimized

ery Processing Over Heterogeneous Data Sources

Edmon Begoli

Oak Ridge National Laboratory

(ORNL)

Oak Ridge, Tennessee, USA

begolie@ornl.gov

Jesús Camacho-Rodríguez

Hortonworks Inc.

Santa Clara, California, USA

jcamacho@hortonworks.com

Julian Hyde

Hortonworks Inc.

Santa Clara, California, USA

jhyde@hortonworks.com

Michael J. Mior

David R. Cheriton School of

Computer Science

University of Waterloo

Waterloo, Ontario, Canada

mmior@uwaterloo.ca

Daniel Lemire

University of Quebec (TELUQ)

Montreal, Quebec, Canada

lemire@gmail.com

ABSTRACT

Apache Calcite is a foundational software framework that provides

query processing, optimization, and query language support to

many popular open-source data processing systems such as Apache

Hive, Apache Storm, Apache Flink, Druid, and MapD. The goal of

this paper is to formally introduce Calcite to the broader research

community, briey present its history, and describe its architecture,

features, functionality, and patterns for adoption. Calcite’s archi-

tecture consists of a modular and extensible query optimizer with

hundreds of built-in optimization rules, a query processor capable of

processing a variety of query languages, an adapter architecture de-

signed for extensibility, and support for heterogeneous data models

and stores (relational, semi-structured, streaming, and geospatial).

This exible, embeddable, and extensible architecture is what makes

Calcite an attractive choice for adoption in big-data frameworks. It

is an active project that continues to introduce support for the new

types of data sources, query languages, and approaches to query

processing and optimization.

CCS CONCEPTS

• Information systems → DBMS engine architectures;

KEYWORDS

Apache Calcite, Relational Semantics, Data Management, Query

Algebra, Modular Query Optimization, Storage Adapters

ACM Reference Format:

Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior,

and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for

Publication rights licensed to ACM. ACM acknowledges that this contribution was

authored or co-authored by an employee, contractor or aliate of the United States

government. As such, the Government retains a nonexclusive, royalty-free right to

publish or reproduce this article, or to allow others to do so, for Government purposes

only.

SIGMOD’18, June 10–15, 2018, Houston, TX, USA

2018 Copyright held by the owner/author(s). Publication rights licensed to the

Association for Computing Machinery.

ACM ISBN 978-1-4503-4703-7/18/06... $15.00

https://doi.org/10.1145/3183713.3190662

Optimized Query Processing Over Heterogeneous Data Sources. In SIG-

MOD’18: 2018 International Conference on Management of Data, June 10–

15, 2018, Houston, TX, USA. ACM, New York, NY, USA, 10 pages. https:

//doi.org/10.1145/3183713.3190662

1 INTRODUCTION

Following the seminal System R, conventional relational database

engines dominated the data processing landscape. Yet, as far back as

2005, Stonebraker and Çetintemel [

] predicted that we would see

the rise a collection of specialized engines such as column stores,

stream processing engines, text search engines, and so forth. They

argued that specialized engines can oer more cost-eective per-

formance and that they would bring the end of the “one size ts

all” paradigm. Their vision seems today more relevant than ever.

Indeed, many specialized open-source data systems have since be-

come popular such as Storm [

] and Flink [

] (stream processing),

Elasticsearch [

] (text search), Apache Spark [

], Druid [

], etc.

As organizations have invested in data processing systems tai-

lored towards their specic needs, two overarching problems have

arisen:

•

The developers of such specialized systems have encoun-

tered related problems, such as query optimization [

]

or the need to support query languages such as SQL and

related extensions (e.g., streaming queries [

]) as well as

language-integrated queries inspired by LINQ [

]. With-

out a unifying framework, having multiple engineers inde-

pendently develop similar optimization logic and language

support wastes engineering eort.

•

Programmers using these specialized systems often have to

integrate several of them together. An organization might

rely on Elasticsearch, Apache Spark, and Druid. We need

to build systems capable of supporting optimized queries

across heterogeneous data sources [55].

Apache Calcite was developed to solve these problems. It is

a complete query processing system that provides much of the

common functionality—query execution, optimization, and query

languages—required by any database management system, except

for data storage and management, which are left to specialized

Industry 1: Adaptive Query Processing

SIGMOD’18, June 10-15, 2018, Houston, TX, USA

221

下载后可阅读完整内容，剩余9页未读，立即下载

cosmo87

粉丝: 0
资源: 28

Apache Calcite：优化跨异构数据源查询处理的核心框架

雷春蔚专访MaxCompute与Calcite的技术和故事.pdf

CalciteRestAPIExample:使用JDBC驱动程序查询Rest API的示例项目

durid数据库连接池的使用方式，定义了工具类，可以通过durid方式访问数据库，实现对数据库的添加

Apache Calcite：优化查询处理的基石框架

Apache Calcite：动态数据管理框架与SQL解析优化

MaxCompute与Apache Calcite：SQL优化实战与个人成长

掌握Apache Calcite：数据管理与处理框架操作实例解析

avro-to-calcite:Avro Schema 到 Calcite RelDataType 转换器

calcite:用于创建Deno插件的rust框架

calcite-kudu：适用于Apache Kudu的Apache Calcite适配器

最新资源