探索Apache Spark权威指南：实战大数据简易之道

Spark

需积分: 9 142 浏览量更新于2024-07-19 收藏 3.96MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

《Apache Spark：权威指南》是一本即将出版的书籍，由Bill Chambers和Matei Zaharia合著，专为了解决大数据的复杂性并介绍Apache Spark的使用提供了深入的指导。自Spark项目成立以来，其影响力和创新不断增长，这在2017年的Spark Summit规模上得到了充分展示。Databricks作为合作伙伴，特别提供了这本书的部分章节——第二、三、四和五章的预览版，供读者免费下载。这些章节涵盖了从基础架构到高级应用的详细介绍。第2章——一个轻松的Spark入门，旨在引导读者逐步理解Spark的核心组件。章节内容包括集群的基本概念，以及Spark应用程序如何通过Spark的结构化API（如DataFrame和SQL）进行操作。作者会详细解释核心术语和概念，确保读者能够立即上手实践Spark。章节开始时，会先介绍一些基础背景知识，例如： 1. **Spark的基本架构**：通常情况下，我们所说的“计算机”是指个人或工作环境中的单机系统。然而，在Spark的世界里，一个集群的概念更为关键，它是由多台机器组成的分布式计算环境。Spark应用程序在其上运行，利用这些机器的协同工作来处理大规模数据。 2. **Spark应用程序**：包括驱动程序（Driver Program）、执行器（Executor）、任务（Task）和数据集（RDD，Resilient Distributed Datasets），这些组件共同构建了Spark的工作流程，使得数据可以在分布式环境中高效地进行读取、转换和分析。 3. **Structured APIs**：Spark的DataFrame和SQL接口是其核心亮点，DataFrame提供了类似于关系型数据库的数据操作方式，而SQL则简化了数据查询和处理，使得用户无需编写复杂的MapReduce代码。 4. **术语与概念**：如内存计算（In-Memory Computation）、延迟执行（Lazy Execution）、容错性（Fault Tolerance）等，这些都是理解和使用Spark必不可少的基础。第3章可能进一步深入讨论Spark的计算模型，比如数据分区（Data Partitioning）、任务调度（Task Scheduling）以及优化策略。第4章和第5章则可能会探讨更具体的主题，比如Spark的生态系统（Ecosystem）（如Spark SQL、Spark Streaming、MLlib等模块的作用）、性能调优（Performance Tuning）和Spark与其他技术的集成（如Hadoop、Kafka等）。阅读这些章节，读者不仅能掌握基本的Spark使用技巧，还能了解到如何在实际场景中最大化Spark的效能。同时，订阅Databricks博客可以获取后续章节的更新，紧跟Spark技术的发展动态。《Spark：权威指南》是学习和深化理解Spark的理想资源，对于从事大数据处理和分析的IT专业人士来说，是一本不可或缺的参考书。

资源详情

资源推荐

Any DataFrame can be made into a table or view with one simple method call.

%scala

ightData2015.createOrReplaceTempView(“ight_data_2015”)

%python

ightData2015.createOrReplaceTempView(“ight_data_2015”)

Now we can query our data in SQL. To execute a SQL query, we’ll use the spark.sql function (remember spark is

our SparkSession variable?) that conveniently, returns a new DataFrame. While this may seem a bit circular in logic

- that a SQL query against a DataFrame returns another DataFrame, it’s actually quite powerful. As a user, you can

specify transformations in the manner most convenient to you at any given point in time and not have to trade any

eiciency to do so! To understand that this is happening, let’s take a look at two explain plans.

%scala

val sqlWay = spark.sql(“””

SELECT DEST_COUNTRY_NAME, count(1)

FROM ight_data_2015

GROUP BY DEST_COUNTRY_NAME

“””)

val dataFrameWay = ightData2015

.groupBy(‘DEST_COUNTRY_NAME)

.count()

sqlWay.explain

dataFrameWay.explain

%python

sqlWay = spark.sql(“””

SELECT DEST_COUNTRY_NAME, count(1)

FROM ight_data_2015

GROUP BY DEST_COUNTRY_NAME

“””)

the data.

Therefore the third step is to specify the aggregation. Let’s use the sum aggregation method. This takes as input

a column expression or simply, a column name. The result of the sum method call is a new dataFrame. You’ll see

that it has a new schema but that it does know the type of each column. It’s important to reinforce (again!) that no

computation has been performed. This is simply another transformation that we’ve expressed and Spark is simply

able to trace the type information we have supplied.

The fourth step is a simple renaming, we use the withColumnRenamed method that takes two arguments, the

original column name and the new column name. Of course, this doesn’t perform computation - this is just another

transformation!

The fih step sorts the data such that if we were to take results o of the top of the DataFrame, they would be the

largest values found in the destination_total column.

You likely noticed that we had to import a function to do this, the desc function. You might also notice that desc

does not return a string but a Column. In general, many DataFrame methods will accept Strings (as column names) or

Column types or expressions. Columns and expressions are actually the exact same thing.

The final step is just a limit. This just specifies that we only want five values. This is just like a filter except that it filters

by position (lazily) instead of by value. It’s safe to say that it basically just specifies a DataFrame of a certain size.

The last step is our action! Now we actually begin the process of collecting the results of our DataFrame above and

Spark will give us back a list or array in the language that we’re executing. Now to reinforce all of this, let’s look at the

explain plan for the above query.

%scala

ightData2015

.groupBy(“DEST_COUNTRY_NAME”)

.sum(“count”)

.withColumnRenamed(“sum(count)”, “destination_total”)

.sort(desc(“destination_total”))

.limit(5)

.explain()

%python

ightData2015\

.groupBy(“DEST_COUNTRY_NAME”)\

.sum(“count”)\

.withColumnRenamed(“sum(count)”, “destination_total”)\

剩余126页未读，继续阅读

qq_36374805

粉丝: 1
资源: 13

探索Apache Spark权威指南：实战大数据简易之道

wx494社区门诊管理系统小程序-php+vue+uniapp.zip（可运行源码+sql文件+文档）

spark the definitive guide(epub)

Spark原著中文版

hadoop.the.definitive.guide.4th.edition.1491901632

帮我找到一本书《Ethernet. The Definitive Guide》

高性能spark pdf

the definitive guide to dax (2nd edition).pdf

mapreduce有什么参考文献

大数据软件技术的参考文献

hive具体参考文献

automotive ethernet – the definitive guide.42章

the definitive guide to arm cortex-m0 and cortex-m0+ processors (2nd edition

数据库课程设计文献参考

hadoop the definitive guide epub

Hadoop，habse，spark 参考文献

np.where(Self_Time < Hour[np.min(np.where(Hour == Country_Time[i])) + 1]), np.inf, np.nan)

html,css,javascript,node.js,Mysql数据库的参考文献

关于sparkstreaming的书籍

802.11 wireless networks: the definitive guide

hadoop相关的外文文献

最新资源