Spark架构解析与并行计算原理

需积分: 15 129 浏览量更新于2024-07-18 收藏 1.68MB PDF 举报

"Spark基本架构与原理教程" Apache Spark是一个强大的并行计算框架，它由UC Berkeley AMP实验室开发并开源，旨在解决大规模数据处理中的速度和效率问题。Spark的设计目标是提供一个比Hadoop MapReduce更高效、更易用的平台，尤其在需要迭代计算的任务中，如数据挖掘和机器学习。 Spark的核心特性在于它的内存计算模型。与Hadoop MapReduce不同，Spark的工作任务可以在内存中保留中间结果，避免了反复读写HDFS（Hadoop分布式文件系统）带来的I/O开销。这种设计使得Spark在迭代计算上表现出显著的性能优势，极大地减少了计算延迟，提高了整体的运行效率。 Spark架构主要由以下几个组件构成： 1. **Driver Program**：驱动程序，它是Spark应用的入口点，负责创建SparkContext，并定义RDD（弹性分布式数据集）和计算任务。 2. **SparkContext**：是Spark应用的主控对象，它连接到Spark集群，管理整个应用的生命周期，包括创建RDD、调度任务等。 3. **Cluster Manager**：集群管理器，如YARN或Mesos，协调资源分配，为应用提供运行环境。 4. **Worker Nodes**：工作节点，执行实际的任务计算，它们从集群管理器获取任务，并在本地内存或磁盘上存储数据。 5. **RDD（Resilient Distributed Datasets）**：是Spark中最核心的概念，是不可变、分区的数据集合，可以在集群中的多个节点上并行操作。 6. **Transformation** 和 **Action**：Spark操作分为两种类型，转换（Transformation）定义数据的处理逻辑，但不会立即执行；动作（Action）触发实际的计算并将结果返回给驱动程序或写入外部存储。 Spark还支持多种编程接口，包括Scala、Java、Python和R，这使得各种背景的开发者都能方便地使用Spark。此外，Spark提供了SQL接口（Spark SQL）用于处理结构化数据，以及Spark Streaming用于实时流处理，MLlib用于机器学习，GraphX用于图计算，这些库丰富了Spark的功能，使其成为一个全面的大数据处理平台。在实际部署和使用Spark时，用户需要注意版权和许可问题，尊重Cloudera、Apache Software Foundation以及其他相关商标持有者的权益。此外，遵守所有适用的版权法律是每个用户的责任，文档的任何部分未经允许不得复制、存储或传输。 Spark通过其高效的内存计算模型、丰富的库支持和多语言接口，成为大数据领域中不可或缺的工具，尤其适合需要快速迭代计算的场景。了解和掌握Spark的基本架构和原理，对于提升大数据处理能力具有重要意义。

Developing Spark Applications

When you are ready to move beyond running core Spark applications in an interactive shell, you need best practices

for building, packaging, and configuring applications and using the more advanced APIs. This section describes:

• How to develop, package, and run Spark applications.

• Aspects of using Spark APIs beyond core Spark.

• How to access data stored in various file formats, such as Parquet and Avro.

• How to access data stored in cloud storage systems, such as Amazon S3.

• Best practices in building and configuring Spark applications.

Developing and Running a Spark WordCount Application

This tutorial describes how to write, compile, and run a simple Spark word count application in three of the languages

supported by Spark: Scala, Python, and Java. The Scala and Java code was originally developed for a Cloudera tutorial

written by Sandy Ryza.

Writing the Application

The example application is an enhanced version of WordCount, the canonical MapReduce example. In this version of

WordCount, the goal is to learn the distribution of letters in the most popular words in a corpus. The application:

1. Creates a SparkConf and SparkContext. A Spark application corresponds to an instance of the SparkContext

class. When running a shell, the SparkContext is created for you.

2. Gets a word frequency threshold.

3. Reads an input set of text documents.

4. Counts the number of times each word appears.

5. Filters out all words that appear fewer times than the threshold.

6. For the remaining words, counts the number of times each letter occurs.

In MapReduce, this requires two MapReduce applications, as well as persisting the intermediate data to HDFS between

them. In Spark, this application requires about 90 percent fewer lines of code than one developed using the MapReduce

API.

Here are three versions of the program:

• Figure 1: Scala WordCount on page 9

• Figure 2: Python WordCount on page 10

• Figure 3: Java 7 WordCount on page 10

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

object SparkWordCount {

def main(args: Array[String]) {

// create Spark context with Spark configuration

val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))

// get threshold

val threshold = args(1).toInt

// read in text file and split each document into words

val tokenized = sc.textFile(args(0)).flatMap(_.split(" "))

// count the occurrence of each word

val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)

Spark Guide | 9

Developing Spark Applications

剩余48页未读，继续阅读

某型号工程师

粉丝: 0
资源: 1

Spark架构解析与并行计算原理

spark基本架构及原理

spark案例与实验教程-高清-2017年4月

spark技术原理

python hadoop与spark教程

大数据之Spark精讲(高清视频教程）.rar

Spark编程基础教程：设计原理与实践

Spark集群搭建与入门教程：实时计算加速

Apache Spark与Hadoop开发者培训教程

Spark分布式数据同步工具教程与源码分享

深入学习PySpark与Spark2.3机器学习视频教程

最新资源