Apache Spark Cheat Sheet

需积分: 9 166 浏览量更新于2023-05-24 收藏 602KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

DZONE.COM/REFCARDZ

257204

Apache Spark

UPDATED BY TIM SPANN BIG DATA SOLUTIONS ENGINEER, HORTONWORKS

WRITTEN BY ASHWINI KUNTAMUKKALA SOFTWARE ARCHITECT, SCISPIKE

WHY APACHE SPARK?

Apache Spark has become the engine to enhance many of the

capabilities of the ever-present Apache Hadoop environment. For

Big Data, Apache Spark meets a lot of needs and runs natively on

Apache Hadoop’s YARN. By running Apache Spark in your Apache

Hadoop environment, you gain all the security, governance, and

scalability inherent to that platform. Apache Spark is also extremely

well integrated with Apache Hive and gains access to all your Apache

Hadoop tables utilizing integrated security.

Apache Spark has begun to really shine in the areas of streaming data

processing and machine learning. With first-class support of Python

as a development language, PySpark allows for data scientists,

engineers and developers to develop and scale machine learning with

ease. One of the features that has expanded this is the support for

Apache Zeppelin notebooks to run Apache Spark jobs for exploration,

data cleanup, and machine learning. Apache Spark also integrates

with other important streaming tools in the Apache Hadoop space,

namely Apache NiFi and Apache Kafka. I like to think of Apache Spark

+ Apache NiFi + Apache Kafka as the three amigos of Apache Big Data

ingest and streaming. The latest version of Apache Spark is 2.2.

ABOUT APACHE SPARK

Apache Spark is an open source, Hadoop-compatible, fast and

expressive cluster-computing data processing engine. It was created

at AMPLabs in UC Berkeley as part of Berkeley Data Analytics Stack

(BDAS). It is a top-level Apache project. The below figure shows the

various components of the current Apache Spark stack.

It has six major benefits:

1. Lightning speed of computation because data are loaded in

distributed memory (RAM) over a cluster of machines. Data can

be quickly transformed iteratively and cached on demand for

subsequent usage.

2. Highly accessible through standard APIs built in Java, Scala,

Python, R, and SQL (for interactive queries) and has a rich set of

machine learning libraries available out of the box.

3. Compatibility with existing Hadoop 2.x (YARN) ecosystems so

companies can leverage their existing infrastructure.

4. Convenient download and installation processes. Convenient

shell (REPL: Read-Eval-Print-Loop) to interactively learn the APIs.

5. Enhanced productivity due to high-level constructs that keep

the focus on content of computation.

6. Multiple user notebook environments supported by Apache

Zeppelin.

Also, Spark is implemented in Scala, which means that the code is

very succinct and fast and requires JVM to run.

HOW TO INSTALL APACHE SPARK

The following table lists a few important links and prerequisites:

Current Release

2.2.0 @ apache.org/dyn/closer.lua/

spark/spark-2.2.0/spark-2.2.0-bin-

hadoop2.7.tgz

Downloads Page

spark.apache.org/downloads.html

JDK Version (Required) 1.8 or higher

Scala Version (Required) 2.11 or higher

Python (Optional) [2.7, 3.5)

Simple Build Tool (Re-

quired)

scala-sbt.org

Development Version

github.com/apache/spark

CONTENTS

∠

WHY APACHE SPARK?

∠ ABOUT APACHE SPARK

∠ HOW TO INSTALL APACHE SPARK

∠ HOW APACHE SPARK WORKS

∠ RESILIENT DISTRIBUTED DATASET

∠ DATAFRAMES

∠ RDD PERSISTENCE

∠ SPARK SQL

∠ SPARK STREAMING

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余6页未读，立即下载

过往记忆

粉丝: 4355
资源: 278

会员权益专享

Apache Spark Cheat Sheet

各类速查表汇总-PySpark Cheat Sheet -Spark in Python

clojure-cheatsheet, 用于Emacs的Clojure Cheatsheet.zip

conda cheat sheet v4.6.zip

R ggplot2 cheatsheet

c语言 cheat sheet

sas cheatsheet

atomichabits.com/cheatsheet

openflow协议_SDN 技术之 OpenFlow 流表 CheatSheet

数据结构 cheatsheet

bash script cheatsheet

scikit-learn算法cheat-sheet

vim-cheat-sheet

ubuntu安装cheat

cheatengine教程 pdf

cheat engine6.7wangpan

Cheat Engine软件怎么使用

cheat engine生成exe

Android studio 自带的git使用手册

cheatengine70教程

Cheat Engine 汉化教程

会员权益专享

最新资源