《Spark权威指南》：数据科学家与工程师的理想选择

Spark

需积分: 12 18 浏览量更新于2024-07-18 1 收藏 7.88MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

资源详情

资源推荐

passes over the data, and in MapReduce, each pass had to be written as a separate MapReduce

job, which had to be launched separately on the cluster and load the data from scratch.

To address this problem, the Spark team first designed an API based on functional programming

that could succinctly express multistep applications. The team then implemented this API over a

new engine that could perform efficient, in-memory data sharing across computation steps. The

team also began testing this system with both Berkeley and external users.

The first version of Spark supported only batch applications, but soon enough another

compelling use case became clear: interactive data science and ad hoc queries. By simply

plugging the Scala interpreter into Spark, the project could provide a highly usable interactive

system for running queries on hundreds of machines. The AMPlab also quickly built on this idea

to develop Shark, an engine that could run SQL queries over Spark and enable interactive use by

analysts as well as data scientists. Shark was first released in 2011.

After these initial releases, it quickly became clear that the most powerful additions to Spark

would be new libraries, and so the project began to follow the “standard library” approach it has

today. In particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX.

They also ensured that these APIs would be highly interoperable, enabling writing end-to-end

big data applications in the same engine for the first time.

In 2013, the project had grown to widespread use, with more than 100 contributors from more

than 30 organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache

Software Foundation as a long-term, vendor-independent home for the project. The early

AMPlab team also launched a company, Databricks, to harden the project, joining the

community of other companies and organizations contributing to Spark. Since that time, the

Apache Spark community released Spark 1.0 in 2014 and Spark 2.0 in 2016, and continues to

make regular releases, bringing new features into the project.

Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of

Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations

such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added

Spark SQL, a new API for working with structured data—tables with a fixed data format that is

not tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations

across libraries and APIs by understanding both the data format and the user code that runs on it

in more detail. Over time, the project added a plethora of new APIs that build on this more

powerful structured foundation, including DataFrames, machine learning pipelines, and

Structured Streaming, a high-level, automatically optimized streaming API. In this book, we will

spend a signficant amount of time explaining these next-generation APIs, most of which are

marked as production-ready.

The Present and Future of Spark

Spark has been around for a number of years but continues to gain in popularity and use cases.

Many new projects within the Spark ecosystem continue to push the boundaries of what’s

possible with the system. For example, a new high-level streaming engine, Structured Streaming,

was introduced in 2016. This technology is a huge part of companies solving massive-scale data

challenges, from technology companies like Uber and Netflix using Spark’s streaming and

machine learning tools, to institutions like NASA, CERN, and the Broad Institute of MIT and

Harvard applying Spark to scientific data analysis.

Spark will continue to be a cornerstone of companies doing big data analysis for the foreseeable

future, especially given that the project is still developing quickly. Any data scientist or engineer

who needs to solve big data problems probably needs a copy of Spark on their machine—and

hopefully, a copy of this book on their bookshelf!

Running Spark

This book contains an abundance of Spark-related code, and it’s essential that you’re prepared to

run it as you learn. For the most part, you’ll want to run the code interactively so that you can

experiment with it. Let’s go over some of your options before we begin working with the coding

parts of the book.

You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs

on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster,

all you need is an installation of Java. If you want to use the Python API, you will also need a

Python interpreter (version 2.7 or later). If you want to use R, you will need a version of R on

your machine.

There are two options we recommend for getting started with Spark: downloading and installing

Apache Spark on your laptop, or running a web-based version in Databricks Community Edition,

a free cloud environment for learning Spark that includes the code in this book. We explain both

of those options next.

Downloading Spark Locally

If you want to download and run Spark locally, the first step is to make sure that you have Java

installed on your machine (available as java), as well as a Python version if you would like to

use Python. Next, visit the project’s official download page, select the package type of “Pre-built

for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR

file, or tarball, that you will then need to extract. The majority of this book was written using

Spark 2.2, so downloading version 2.2 or later should be a good starting point.

Downloading Spark for a Hadoop cluster

Spark can run locally without any distributed storage system, such as Apache Hadoop. However,

if you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you

download the right Spark version for that Hadoop version, which can be chosen at

http://spark.apache.org/downloads.html by selecting a different package type. We discuss how

Spark runs on clusters and the Hadoop file system in later chapters, but at this point we

Chapter 2. A Gentle Introduction to

Spark

Now that our history lesson on Apache Spark is completed, it’s time to begin using and applying

it! This chapter presents a gentle introduction to Spark, in which we will walk through the core

architecture of a cluster, Spark Application, and Spark’s structured APIs using DataFrames and

SQL. Along the way we will touch on Spark’s core terminology and concepts so that you can

begin using Spark right away. Let’s get started with some basic background information.

Spark’s Basic Architecture

Typically, when you think of a “computer,” you think about one machine sitting on your desk at

home or at work. This machine works perfectly well for watching movies or working with

spreadsheet software. However, as many users likely experience at some point, there are some

things that your computer is not powerful enough to perform. One particularly challenging area

is data processing. Single machines do not have enough power and resources to perform

computations on huge amounts of information (or the user probably does not have the time to

wait for the computation to finish). A cluster, or group, of computers, pools the resources of

many machines together, giving us the ability to use all the cumulative resources as if they were

a single computer. Now, a group of machines alone is not powerful, you need a framework to

coordinate work across them. Spark does just that, managing and coordinating the execution of

tasks on data across a cluster of computers.

The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like

Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to

these cluster managers, which will grant resources to our application so that we can complete our

work.

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver process

runs your main() function, sits on a node in the cluster, and is responsible for three things:

maintaining information about the Spark Application; responding to a user’s program or input;

and analyzing, distributing, and scheduling work across the executors (discussed momentarily).

The driver process is absolutely essential—it’s the heart of a Spark Application and maintains all

relevant information during the lifetime of the application.

The executors are responsible for actually carrying out the work that the driver assigns them.

This means that each executor is responsible for only two things: executing code assigned to it

by the driver, and reporting the state of the computation on that executor back to the driver node.

剩余600页未读，继续阅读

summerfoliage

粉丝: 0
资源: 10

《Spark权威指南》：数据科学家与工程师的理想选择

spark the definitive guide(epub)

Spark原著中文版

Spark: The Definitive Guide: Big Data Processing Made Simple 英文.pdf版

AttributeError: module 'definite' has no attribute 'connect'

Error in chol.default(sigma.u) : the leading minor of order 1 is not positive definite

Evaluate the definite integral f1/0 dx/(1+25x^2)^-2

机器学习共轭梯度法matlab代码

The non definite integral of (e ^ x) * (sinx) ^ 2

# basic model > model_basic <- nlme(wt ~ a*(1+b*exp(-k*t))^m, + data=df_phen_group, + fixed=a+b+k+m ~ 1, + random=a+k ~ 1, + start=c(155, 65, 2.5, 70)) Error in chol.default((value + t(value))/2) : the leading minor of order 2 is not positive definite >

matlab V=pi/40*trapz(V0,4);

低複雜度semi definite relaxation

quadgk(@(t) cos(n*w0*t),-inf,t)

linalg.lapack.dpotrs

matlab ingegrator

import definite def f1(x): y = 1 + x return y def f2(x): y = 1 / (1+4*x**2) return y print("%.2f" % definite.connect(f1,0,2,100)) print("%.2f" % definite.connect(f2,-1,1,100))

the leading minor of order 1 is not positive definite怎么解决

integral(@(s)Fm(s),-inf,t)

trapeze积分算法

‎用c++计算sin(x)在区间[0,3.14]上的定积分，并输出结果。

最新资源

# basic model > model_basic <- nlme(wt ~ a(1+bexp(-k*t))^m, + data=df_phen_group, + fixed=a+b+k+m ~ 1, + random=a+k ~ 1, + start=c(155, 65, 2.5, 70)) Error in chol.default((value + t(value))/2) : the leading minor of order 2 is not positive definite >

quadgk(@(t) cos(nw0t),-inf,t)