精通Apache Spark高级分析

需积分: 8 84 浏览量更新于2024-07-18 收藏 10.35MB PDF 举报

"Mastering Advanced Analytics With Apache Spark" Apache Spark 是一个强大的开源大数据处理框架，以其高效、易用和适用于大规模数据处理的特性而备受推崇。本资料“Mastering Advanced Analytics with Apache Spark”聚焦于Spark 1.4版本中的高级分析特性，旨在帮助用户深入理解Spark的各个组件以及它们在不同应用场景下的功能。 Spark的核心组件包括Spark Core、Spark SQL、Spark Streaming、MLlib（机器学习库）和GraphX（图计算库）。在Spark 1.4版本中，这些组件得到了进一步的优化和增强： 1. Spark Core：作为基础架构，Spark Core提供了分布式任务调度、内存管理、错误恢复和与存储系统的接口。1.4版可能包含了性能优化和新的API改进，以提升整体处理效率。 2. Spark SQL：这是Spark用于结构化数据处理的部分，允许用户使用SQL查询或DataFrame API操作数据。在1.4版中，它可能强化了对多种数据源的支持，提高了查询性能，并引入了更多的SQL兼容性。 3. Spark Streaming：该组件处理实时数据流，支持微批处理模型。1.4版可能增强了稳定性，提升了处理速率，并提供了更丰富的数据源连接选项。 4. MLlib：Spark的机器学习库包含多种算法，如分类、回归、聚类、协同过滤等。在1.4版中，可能更新了算法实现，提升了预测准确性和训练速度，同时增加了模型调优和评估工具。 5. GraphX：针对图数据处理，GraphX提供了高效的图操作和算法。在1.4版，它可能增强了图的创建、查询和分析能力，支持了更复杂的图算法。此外，Databricks博客的亮点还可能涵盖了以下主题： - 数据科学和数据分析的最佳实践，包括如何有效地利用Spark进行数据探索、清洗和预处理。 - 大规模机器学习工作流，包括数据准备、模型训练、验证和部署。 - 集群管理和资源调度优化，以提高Spark应用在分布式环境中的性能。 - 容器化和云集成，使Spark更容易在Docker或Kubernetes等平台上运行。 - 安全性与隐私保护，讨论了如何在Spark中实现数据加密和访问控制。通过这份资料，读者不仅可以掌握Spark 1.4的关键改进，还能了解到来自业界专家的实际案例和经验分享，从而提升自己的大数据分析技能。无论是数据科学家、开发人员还是系统管理员，都能从中受益，将Apache Spark的强大功能应用于复杂的数据分析任务。

Scalable Collaborative Filtering

with Spark MLlib

July 23, 2014 | by Burak Yavuz, Xiangrui Meng and Reynold Xin

Recommendation systems are among the most popular applications of

machine learning. The idea is to predict whether a customer would like a

certain item: a product, a movie, or a song.Scale is a key concern for

recommendation systems, since computational complexity increases

with the size of a company’s customer base. In this blog post, we discuss

how Spark MLlib enables building recommendation models from billions

of records in just a few lines of Python (Scala/Java APIs also available).

What’s Happening under the Hood?

Recommendation algorithms are usually divided into:

(1) Content-based filtering: recommending items similar to what users

already like. An example would be to play a Megadeth song aer a

Metallica song.

(2) Collaborative filtering: recommending items based on what similar

users like, e.g., recommending video games aer someone purchased a

game console because other people who bought game consoles also

bought video games.

Spark MLlib implements a collaborative filtering algorithm called

Alternating Least Squares (ALS), which has been implemented in many

machine learning libraries and widely studied and used in both academia

and industry. ALS models the rating matrix (R) as the multiplication of

low-rank user (U) and product (V) factors, and learns these factors by

minimizing the reconstruction error of the observed ratings. The

unknown ratings can subsequently be computed by multiplying these

Scalable Collaborative Filtering with Spark MLlib

from pyspark.mllib.recommendation import ALS

# load training and test data into (user, product, rating) tuples

def parseRating(line):

!!fields = line.split()

!!return (int(fields[0]), int(fields[1]), float(fields[2]))!!

training = sc.textFile("...").map(parseRating).cache()

test = sc.textFile("...").map(parseRating)

# train a recommendation model

model = ALS.train(training, rank = 10, iterations = 5)

# make predictions on (user, product) pairs from the test data

predictions = model.predictAll(test.map(lambda x: (x[0], x[1])))

factors. In this way, companies can recommend products based on the

predicted ratings and increase sales and customer satisfaction.

ALS is an iterative algorithm. In each iteration, the algorithm alternatively

fixes one factor matrix and solves for the other, and this process

continues until it converges. MLlib features a blocked implementation of

the ALS algorithm that leverages Spark’s eﬀicient support for distributed,

iterative computation. It uses native LAPACK to achieve high performance

and scales to billions of ratings on commodity clusters.

Scalability, Performance, and Stability

Recently we did an experiment to benchmark ALS implementations in

Spark MLlib at scale. The benchmark was conducted on EC2 using

m3.2xlarge instances set up by the Spark EC2 script. We ran Spark using

out-of-the-box configurations. To help understand state-of-the-art, we

also built Mahout from GitHub and tested it. This benchmark is

reproducible on EC2 using the scripts at https://github.com/databricks/

als-benchmark-scripts.

We ran 5 iterations of ALS on scaled copies of the Amazon Reviews

dataset, which contains 35 million ratings collected from 6.6 million users

on 2.4 million products. For each user, we create pseudo-users that have

the same ratings. That is, for every rating as (userId, productId, rating), we

generate (userId+i, productId, rating) where 0 <= i < s and s is the scaling

factor.

The current version of Mahout runs on Hadoop MapReduce, whose

scheduling overhead and lack of support for iterative computation

substantially slows down ALS. Mahout recently announced switching to

Spark as the execution engine, which will hopefully address the

performance concerns.

Spark MLlib demonstrated excellent performance and scalability, as

demonstrated in the chart above. MLlib can alsoscale to much larger

datasets and tolarger number of nodes, thanks to its fault-tolerance

design.With 50 nodes, weran 10 iterations of MLlib’s ALS on 100 copies of

the Amazon Reviews dataset in only 40 minutes. And with EC2 spot

instances the total cost was less than $2.Users can use Spark MLlib to

reduce the model training time and the cost for ALS, which is historically

very expensive to run because the algorithm is very communication

intensive and computationintensive.

Scalable Collaborative Filtering with Spark MLlib

剩余74页未读，继续阅读

zhaozhentao

粉丝: 1
资源: 12

精通Apache Spark高级分析

mastering-apache-spark最好的spark教程

mastering-apache-spark2.4.2.pdf

mastering apache pulsar pdf

CogView: Mastering Text-to-Image Generation via Transformers

给我一个markdown下载地址

列举几个SpringBoot相关的技术文献,以文献引用的格式展示

关于springboot的外文文献

springboot的参考文献

spring boot项目英文文献

springboot框架外文参考文献列表

最新资源