Apache Calcite：优化查询处理的基石框架

需积分: 9 194 浏览量更新于2024-09-09 收藏 960KB PDF 举报

Apache Calcite是一个基础软件框架，专为优化异构数据源上的查询处理而设计。该框架由多个模块组成，具有高度可扩展性和灵活性，对于诸如Apache Hive、Apache Storm、Apache Flink、Druid和MapD等流行的开源数据处理系统提供了强大的支持。其核心组件包括： 1. **模块化与可扩展的查询优化器**： Calcite的查询优化器是其架构的核心部分，它由数百个内置优化规则构成。这些规则可以根据特定的数据结构、查询类型和工作负载动态调整，从而实现高效的查询执行计划。通过模块化设计，用户可以方便地添加自定义优化策略，以适应不断变化的业务需求。 2. **多语言查询处理能力**： Calcite能够处理多种查询语言，这意味着开发者无需局限于某一特定语法，而是可以编写兼容不同系统的SQL查询，增强了系统的兼容性和易用性。 3. **适应性强的适配器架构**：为了支持各种数据源和存储引擎，Calcite采用了一种灵活的适配器架构。这种架构允许开发者轻松地将新数据源或存储技术集成到框架中，确保查询能够无缝地访问和操作不同的数据存储。 4. **支持异构数据**：因为涉及到多种数据源，Calcite旨在处理来自不同数据格式、结构和速度的数据。这使得它成为处理大数据集的理想选择，尤其是在实时流处理和批量分析场景中。 5. **作者团队背景**：本文献的作者包括来自知名公司如Hortonworks和University of Waterloo的专家，他们对Apache Calcite的开发和优化有着深厚的专业知识，确保了框架的技术先进性和实用性。总结来说，Apache Calcite作为一个强大的底层框架，为现代数据处理系统提供了一套全面的解决方案，其优化器、多语言支持和适配性使其在处理复杂的数据查询时表现出色，成为数据工程师和分析师不可或缺的工具。

Apache Calcite: A Foundational Framework for Optimized

ery Processing Over Heterogeneous Data Sources

Edmon Begoli

Oak Ridge National Laboratory

(ORNL)

Oak Ridge, Tennessee, USA

begolie@ornl.gov

Jesús Camacho-Rodríguez

Hortonworks Inc.

Santa Clara, California, USA

jcamacho@hortonworks.com

Julian Hyde

Hortonworks Inc.

Santa Clara, California, USA

jhyde@hortonworks.com

Michael J. Mior

David R. Cheriton School of

Computer Science

University of Waterloo

Waterloo, Ontario, Canada

mmior@uwaterloo.ca

Daniel Lemire

University of Quebec (TELUQ)

Montreal, Quebec, Canada

lemire@gmail.com

ABSTRACT

Apache Calcite is a foundational software framework that provides

query processing, optimization, and query language support to

many popular open-source data processing systems such as Apache

Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite’s ar-

chitecture consists of a modular and extensible query optimizer

with hundreds of built-in optimization rules, a query processor

capable of processing a variety of query languages, an adapter ar-

chitecture designed for extensibility, and support for heterogeneous

data models and stores (relational, semi-structured, streaming, and

geospatial). This exible, embeddable, and extensible architecture

is what makes Calcite an attractive choice for adoption in big-

data frameworks. It is an active project that continues to introduce

support for the new types of data sources, query languages, and

approaches to query processing and optimization.

CCS CONCEPTS

• Information systems → DBMS engine architectures;

KEYWORDS

Apache Calcite, Relational Semantics, Data Management, Query

Algebra, Modular Query Optimization, Storage Adapters

1 INTRODUCTION

Following the seminal System R, conventional relational database

engines dominated the data processing landscape. Yet, as far back as

2005, Stonebraker and Çetintemel [

] predicted that we would see

the rise a collection of specialized engines such as column stores,

stream processing engines, text search engines, and so forth. They

Publication rights licensed to ACM. ACM acknowledges that this contribution was

authored or co-authored by an employee, contractor or aliate of the United States

government. As such, the Government retains a nonexclusive, royalty-free right to

publish or reproduce this article, or to allow others to do so, for Government purposes

only.

SIGMOD’18, June 10–15, 2018, Houston, TX, USA

2018 Copyright held by the owner/author(s). Publication rights licensed to the

Association for Computing Machinery.

ACM ISBN 978-1-4503-4703-7/18/06... $15.00

https://doi.org/10.1145/3183713.3190662

argued that specialized engines can oer more cost-eective per-

formance and that they would bring the end of the “one size ts

all” paradigm. Their vision seems today more relevant than ever.

Indeed, many specialized open-source data systems have since be-

come popular such as Storm [

] and Flink [

] (stream processing),

Elasticsearch [

] (text search), Apache Spark [

], Druid [

], etc.

As organizations have invested in data processing systems tai-

lored towards their specic needs, two overarching problems have

arisen:

•

The developers of such specialized systems have encoun-

tered related problems, such as query optimization [

]

or the need to support query languages such as SQL and

related extensions (e.g., streaming queries [

]) as well as

language-integrated queries inspired by LINQ [

]. With-

out a unifying framework, having multiple engineers inde-

pendently develop similar optimization logic and language

support wastes engineering eort.

•

Programmers using these specialized systems often have to

integrate several of them together. An organization might

rely on Elasticsearch, Apache Spark, and Druid. We need

to build systems capable of supporting optimized queries

across heterogeneous data sources [55].

Apache Calcite was developed to solve these problems. It is

a complete query processing system that provides much of the

common functionality—query execution, optimization, and query

languages—required by any database management system, except

for data storage and management, which are left to specialized

engines. Calcite was quickly adopted by Hive, Drill [

], Storm,

and many other data processing engines, providing them with

advanced query optimizations and query languages.

For example,

Hive [

] is a popular data warehouse project built on top of Apache

Hadoop. As Hive moved from its batch processing roots towards an

interactive SQL query answering platform, it became clear that the

project needed a powerful optimizer at its core. Thus, Hive adopted

Calcite as its optimizer and their integration has been growing since.

Many other projects and products have followed suit, including

Flink, MapD [12], etc.

http://calcite.apache.org/docs/powered_by

arXiv:1802.10233v1 [cs.DB] 28 Feb 2018

下载后可阅读完整内容，剩余9页未读，立即下载

sandyfog

粉丝: 0
资源: 2

Apache Calcite：优化查询处理的基石框架

MaxCompute与Apache Calcite：SQL优化实战与个人成长

Apache Calcite 入门教程：学习SQL解析与Schema模型

Apache Calcite Avatica Core 1.17.0 中文API文档

如何使用OpenTSDB进行时间序列数据的高效存储与分析，并结合Calcite进行SQL查询优化？

presto_simple:初始化学习sql解析技术

Calcite-core-1.2.0中英文API文档对照版

Apache Calcite 框架使用演示与分析

Apache Calcite测试代码深度解析与交流指南

深入解析Calcite框架：动态数据管理的原理与实践

时序数据库TSDB：SQL与流计算解析

最新资源