Spark与Cloudera深度整合：实战指南与Apache许可的代码示例

需积分: 11 42 浏览量更新于2024-07-18 收藏 1.74MB PDF 举报

Spark与Cloudera是大数据处理领域的重要组合，本文档提供了一个完备且清晰的代码实例，是深入学习Apache Spark的理想资源。Spark是由Yahoo! labs开发并开源，后来成为Apache软件基金会项目，而Cloudera是一家专注于Apache Hadoop生态系统的企业，提供了Hadoop的商业化支持和服务。 Spark Guide涵盖了Spark的核心组件，如Spark SQL、Spark Streaming、MLlib（机器学习库）以及GraphX等，这些组件在数据处理、实时分析和大规模数据挖掘中发挥关键作用。它强调了Spark基于内存计算模型的优势，能够提供比Hadoop MapReduce更快的速度和更高效的性能，尤其是在迭代式计算任务中。文档中的重要通知表明，所有内容受Cloudera及其供应商或许可者的版权保护，未经事先书面许可，不得复制、模仿或部分使用。所有代码示例均遵循Apache License 2.0，这是一种开源许可协议，允许用户在遵守特定条款下自由使用和分发代码。同时，文档也提到了Hadoop和Hadoop大象Logo是Apache Software Foundation的商标，其他提及的品牌、产品或服务名都是各自所有者财产。对于学习者来说，通过这份文档，你可以了解到如何在Cloudera的平台上部署和管理Spark集群，以及如何利用其API进行数据处理和分析。无论是初学者还是进阶开发者，都能从中找到适合的学习路径和实战案例，从而提升在大数据分析领域的技能。此外，文档还可能包含如何优化Spark性能、故障排查和集群管理等方面的知识，帮助读者理解和实践Spark的最佳实践。这份资源是Spark开发者和数据科学家不可或缺的学习资料，对于理解和利用Spark在云环境中进行高效的数据处理具有重要意义。

// filter out words with fewer than threshold occurrences

val filtered = wordCounts.filter(_._2 >= threshold)

// count characters

val charCounts = filtered.flatMap(_._1.toCharArray).map((_, 1)).reduceByKey(_ + _)

System.out.println(charCounts.collect().mkString(", "))

}

Figure 1: Scala WordCount

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

# create Spark context with Spark configuration

conf = SparkConf().setAppName("Spark Count")

sc = SparkContext(conf=conf)

# get threshold

threshold = int(sys.argv[2])

# read in text file and split each document into words

tokenized = sc.textFile(sys.argv[1]).flatMap(lambda line: line.split(" "))

# count the occurrence of each word

wordCounts = tokenized.map(lambda word: (word, 1)).reduceByKey(lambda v1,v2:v1 +v2)

# filter out words with fewer than threshold occurrences

filtered = wordCounts.filter(lambda pair:pair[1] >= threshold)

# count characters

charCounts = filtered.flatMap(lambda pair:pair[0]).map(lambda c: c).map(lambda c: (c,

1)).reduceByKey(lambda v1,v2:v1 +v2)

list = charCounts.collect()

print repr(list)[1:-1]

Figure 2: Python WordCount

import java.util.ArrayList;

import java.util.Arrays;

import java.util.Collection;

import org.apache.spark.api.java.*;

import org.apache.spark.api.java.function.*;

import org.apache.spark.SparkConf;

import scala.Tuple2;

public class JavaWordCount {

public static void main(String[] args) {

// create Spark context with Spark configuration

JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("Spark Count"));

// get threshold

final int threshold = Integer.parseInt(args[1]);

// read in text file and split each document into words

JavaRDD<String> tokenized = sc.textFile(args[0]).flatMap(

new FlatMapFunction() {

public Iterable call(String s) {

return Arrays.asList(s.split(" "));

}

);

10 | Spark Guide

Developing Spark Applications

// count the occurrence of each word

JavaPairRDD<String, Integer> counts = tokenized.mapToPair(

new PairFunction() {

public Tuple2 call(String s) {

return new Tuple2(s, 1);

}

).reduceByKey(

new Function2() {

public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}

);

// filter out words with fewer than threshold occurrences

JavaPairRDD<String, Integer> filtered = counts.filter(

new Function, Boolean>() {

public Boolean call(Tuple2 tup) {

return tup._2 >= threshold;

}

);

// count characters

JavaPairRDD<Character, Integer> charCounts = filtered.flatMap(

new FlatMapFunction<Tuple2<String, Integer>, Character>() {

@Override

public Iterable<Character> call(Tuple2<String, Integer> s) {

Collection<Character> chars = new ArrayList<Character>(s._1().length());

for (char c : s._1().toCharArray()) {

chars.add(c);

}

return chars;

}

).mapToPair(

new PairFunction<Character, Character, Integer>() {

@Override

public Tuple2<Character, Integer> call(Character c) {

return new Tuple2<Character, Integer>(c, 1);

}

).reduceByKey(

new Function2<Integer, Integer, Integer>() {

@Override

public Integer call(Integer i1, Integer i2) {

return i1 + i2;

}

);

System.out.println(charCounts.collect());

}

Figure 3: Java 7 WordCount

Because Java 7 does not support anonymous functions, this Java program is considerably more verbose than Scala and

Python, but still requires a fraction of the code needed in an equivalentMapReduce program. Java 8 supports anonymous

functions and their use can further streamline the Java application.

Compiling and Packaging the Scala and Java Applications

The tutorial uses Maven to compile and package the Scala and Java programs. Excerpts of the tutorial pom.xml are

included below. For best practices using Maven to build Spark applications, see Building Spark Applications on page

29.

Spark Guide | 11

Developing Spark Applications

剩余56页未读，继续阅读

横店选手

粉丝: 3
资源: 1

Spark与Cloudera深度整合：实战指南与Apache许可的代码示例

spark-2.1.0-bin-2.6.0-cdh5.7.0.tar

Cloudera Manager及CDH从5.4.8升级到5.12.1全过程&安装Spark2.2全过程

CM5.12.1安装spark2.2.0cloudera2详细过程（附截图）

spark2-2.4.0.cloudera2-1.cdh5.13.3.p0.1041012-el6.parcel下载

cdh6.3.2启动spark提示Failed to bind HistoryServer java.io.IOException: /run/cloudera-scm-agent/process/342-spark_on_yarn-SPARK_YARN_HISTORY_SERVER is a directory

集群部署cloudera manager

cdh配置hive on spark

我要在cdh6中部署spark3

Cloudera Certified Data Engineer是什么 怎么学习呢

impala 中的spark

最新资源

Cloudera Certified Data Engineer是什么怎么学习呢