SparkSQL：DataFrame与优化的Relational Processing

需积分: 50 26 浏览量更新于2024-09-13 1 收藏 515KB PDF 举报

"Spark DataFrame是Apache Spark中的一个模块，它将关系处理与Spark的功能编程API融合在一起，提供了DataFrame数据结构，使得Spark程序员可以利用关系处理的优势（如声明性查询和优化存储），同时允许SQL用户调用Spark中的复杂分析库（如机器学习）。SparkSQL主要增加了两个关键特性：一是通过声明性的DataFrame API实现关系和过程处理的更紧密集成；二是引入了高度可扩展的优化器Catalyst，该优化器基于Scala编程语言的特性，便于添加组合规则、控制代码生成和定义扩展点。借助Catalyst，开发人员能够构建出更高效的数据处理管道。" Spark DataFrame是Spark 2.0及更高版本的核心组件，它是对原始Resilient Distributed Datasets (RDDs)的一个重要升级。RDDs是Spark的基础数据结构，而DataFrame在RDD之上添加了更多元数据，使其更接近于传统数据库中的表格，支持更多的关系操作。DataFrame提供了统一的API，可以方便地在Scala、Java、Python和R等多语言之间进行交互。 DataFrame的主要优点包括： 1. **声明性编程**：用户可以使用SQL语句或者DataFrame的API（如`select`, `filter`, `groupBy`等）来表达数据处理逻辑，系统会自动优化执行计划。 2. **类型安全**：DataFrame中的数据带有元数据，这意味着它具有列名和数据类型，这在编译时就能检查错误，提高了代码的健壮性。 3. **性能优化**：通过Catalyst优化器，Spark可以生成高效的执行计划，包括代码生成和优化。Catalyst使用抽象语法树(AST)来表示查询，可以应用一系列优化规则，如常量折叠、消除冗余操作等。 4. **跨语言支持**：DataFrame API在多种语言中可用，使得不同背景的开发人员可以方便地使用Spark。 5. **集成性**：SparkSQL可以无缝地读取和写入多种数据源，如Hive、Parquet、JSON、CSV等，这极大地增强了Spark与其他大数据生态系统的互操作性。 6. **SQL支持**：SparkSQL提供了一个JDBC/ODBC服务器，允许外部工具通过标准SQL接口连接到Spark集群，进行查询和分析。 Spark DataFrame的使用不仅限于简单的查询，还支持复杂的转换和操作，例如窗口函数、聚合、连接和分组。此外，DataFrame还可以与Spark的MLlib库集成，用于大规模机器学习任务，实现数据预处理、模型训练和评估。 Spark DataFrame是Spark平台上的一个强大工具，它简化了数据处理流程，提升了性能，且提供了丰富的功能，使得Spark成为企业级大数据处理和分析的理想选择。通过SparkSQL，开发人员和数据科学家能够以更高效、灵活的方式与数据交互，从而更好地挖掘数据价值。

展开

Spark SQL

Resilient Distributed Datasets

Spark

JDBC Console

User Programs

(Java, Scala, Python)

Catalyst Optimizer

DataFrame API

Figure 1: Interfaces to Spark SQL, and interaction with Spark.

3.1 DataFrame API

The main abstraction in Spark SQL’s API is a DataFrame, a dis-

tributed collection of rows with a homogeneous schema. A DataFrame

is equivalent to a table in a relational database, and can also be

manipulated in similar ways to the “native” distributed collections

in Spark (RDDs).

Unlike RDDs, DataFrames keep track of their

schema and support various relational operations that lead to more

optimized execution.

DataFrames can be constructed from tables in a system cata-

log (based on external data sources) or from existing RDDs of

native Java/Python objects (Section 3.5). Once constructed, they

can be manipulated with various relational operators, such as where

and groupBy, which take expressions in a domain-speciﬁc language

(DSL) similar to data frames in R and Python [32, 30]. Each

DataFrame can also be viewed as an RDD of Row objects, allowing

users to call procedural Spark APIs such as map.

Finally, unlike traditional data frame APIs, Spark DataFrames

are lazy, in that each DataFrame object represents a logical plan to

compute a dataset, but no execution occurs until the user calls a spe-

cial “output operation” such as save. This enables rich optimization

across all operations that were used to build the DataFrame.

To illustrate, the Scala code below deﬁnes a DataFrame from a

table in Hive, derives another based on it, and prints a result:

ctx = new Hive Cont ext ()

users = ctx.table (" users ")

young = users .where ( users(" age") < 21)

println(young. count ())

In this code, users and young are DataFrames. The snippet

users("age") < 21 is an expression in the data frame DSL, which

is captured as an abstract syntax tree rather than representing a

Scala function as in the traditional Spark API. Finally, each DataFrame

simply represents a logical plan (i.e., read the users table and ﬁlter

for age < 21). When the user calls count, which is an output opera-

tion, Spark SQL builds a physical plan to compute the ﬁnal result.

This might include optimizations such as only scanning the “age”

column of the data if its storage format is columnar, or even using

an index in the data source to count the matching rows.

We next cover the details of the DataFrame API.

3.2 Data Model

Spark SQL uses a nested data model based on Hive [19] for ta-

bles and DataFrames. It supports all major SQL data types, includ-

ing boolean, integer, double, decimal, string, date, and timestamp,

We chose the name DataFrame because it is similar to structured data li-

braries in R and Python, and designed our API to resemble those.

These Row objects are constructed on the ﬂy and do not necessarily rep-

resent the internal storage format of the data, which is typically columnar.

as well as complex (i.e., non-atomic) data types: structs, arrays,

maps and unions. Complex data types can also be nested together

to create more powerful types. Unlike many traditional DBMSes,

Spark SQL provides ﬁrst-class support for complex data types in

the query language and the API. In addition, Spark SQL also sup-

ports user-deﬁned types, as described in Section 4.4.2.

Using this type system, we have been able to accurately model

data from a variety of sources and formats, including Hive, rela-

tional databases, JSON, and native objects in Java/Scala/Python.

3.3 DataFrame Operations

Users can perform relational operations on DataFrames using a

domain-speciﬁc language (DSL) similar to R data frames [32] and

Python Pandas [30]. DataFrames support all common relational

operators, including projection (select), ﬁlter (where), join, and

aggregations (groupBy). These operators all take expression ob-

jects in a limited DSL that lets Spark capture the structure of the

expression. For example, the following code computes the number

of female employees in each department.

employees

.join(dept , employees("deptId") === dept("id"))

.where(employees("gender") === "female")

.groupBy(dept("id"), dept("name"))

.agg(count("name"))

Here, employees is a DataFrame, and employees("deptId") is

an expression representing the deptId column. Expression ob-

jects have many operators that return new expressions, including

the usual comparison operators (e.g., === for equality test, > for

greater than) and arithmetic ones (+, -, etc). They also support ag-

gregates, such as count("name"). All of these operators build up an

abstract syntax tree (AST) of the expression, which is then passed

to Catalyst for optimization. This is unlike the native Spark API

that takes functions containing arbitrary Scala/Java/Python code,

which are then opaque to the runtime engine. For a detailed listing

of the API, we refer readers to Spark’s ofﬁcial documentation [6].

Apart from the relational DSL, DataFrames can be registered as

temporary tables in the system catalog and queried using SQL. The

code below shows an example:

users .where (users ("age ") < 21)

.registerTempTable("young")

ctx .sql (" SEL ECT count (*) , avg ( age) FROM yo ung ")

SQL is sometimes convenient for computing multiple aggregates

concisely, and also allows programs to expose datasets through JD-

BC/ODBC. The DataFrames registered in the catalog are still un-

materialized views, so that optimizations can happen across SQL

and the original DataFrame expressions. However, DataFrames can

also be materialized, as we discuss in Section 3.6.

3.4 DataFrames versus Relational Query Languages

While on the surface, DataFrames provide the same operations as

relational query languages like SQL and Pig [29], we found that

they can be signiﬁcantly easier for users to work with thanks to

their integration in a full programming language. For example,

users can break up their code into Scala, Java or Python functions

that pass DataFrames between them to build a logical plan, and

will still beneﬁt from optimizations across the whole plan when

they run an output operation. Likewise, developers can use control

structures like if statements and loops to structure their work. One

user said that the DataFrame API is “concise and declarative like

SQL, except I can name intermediate results,” referring to how it is

easier to structure computations and debug intermediate steps.

To simplify programming in DataFrames, we also made API an-

alyze logical plans eagerly (i.e., to identify whether the column

下载后可阅读完整内容，剩余11页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

AIRZHAO231

粉丝: 0

SparkSQL：DataFrame与优化的Relational Processing

掌握Spark Optics：为Spark DataFrame添加光学组件

利用Sparksheet将Excel转换为Spark DataFrame：原型开发与实践

Apache Spark DataFrame基础教程与应用

spark dataframe

Spark dataframe使用详解

Spark DataFrame 演示Demo

spark dataframe与pandas dataframe

spark sql和spark dataframe

spark dataframe join

spark dataframe foreachpartition

最新资源