掌握Spark SQL架构师：实时分析与机器学习实战教程

需积分: 10 130 浏览量更新于2024-07-19 收藏 40.51MB PDF 举报

"《学习Spark SQL架构师：流式分析与机器学习解决方案》是一本由Aurobindo Sarkar编著的专业书籍，针对那些希望深入了解Apache Spark SQL在实时数据处理和机器学习领域的专业人士。该书主要聚焦于Spark SQL的高级特性，特别是其在流式计算（streaming analytics）中的应用，以及如何利用它构建高效、可扩展的机器学习模型。 Spark SQL是Apache Spark生态系统的一部分，它将SQL语言与Spark的数据处理能力相结合，使得数据科学家和开发者能够方便地操作和分析大规模数据集。书中详细讲解了如何使用Spark SQL进行数据清洗、转换、加载和查询，同时探讨了DStreams（持续数据流）的概念，这是Spark Streaming的核心组成部分，用于实时处理和分析数据。此外，作者还将焦点放在如何将Spark SQL与机器学习技术相结合，如MLlib（Spark的机器学习库），以实现预测分析、分类、聚类等任务。读者可以学习到如何在Spark环境中训练模型，进行模型评估，并将模型部署到生产环境中的实践技巧。本书旨在提供一个从基础到进阶的学习路径，适合对Spark有基本了解但希望进一步提升在实时数据分析和机器学习方面技能的读者。版权信息表明，未经出版商Packt Publishing的书面许可，书中的内容不得以任何形式复制、存储或传播。尽管作者和出版商已尽力确保信息的准确性，但书中的所有内容均按原样出售，不附带任何保证，包括明示或暗示的质量保证。《学习Spark SQL架构师：流式分析与机器学习解决方案》的出版日期为2017年8月，由Packt Publishing发行。书中还包含关于相关公司和产品的商标信息，尽管出版商努力确保这些信息的准确性，但无法完全保证其详尽无误。这本书是对Spark SQL技术栈的全面指南，对于任何希望在这个快速发展的领域取得成功的专业人士来说，都是一份宝贵的资源。"

Developing a machine learning application
Summary
10.  Using Spark SQL in Deep Learning Applications
Introducing neural networks
Understanding deep learning
Understanding representation learning
Understanding stochastic gradient descent
Introducing deep learning in Spark
Introducing CaffeOnSpark
Introducing DL4J
Introducing TensorFrames
Working with BigDL
Tuning hyperparameters of deep learning models
Introducing deep learning pipelines
Understanding Supervised learning
Understanding convolutional neural networks
Using neural networks for text classification
Using deep neural networks for language processing
Understanding Recurrent Neural Networks
Introducing autoencoders
Summary
11.  Tuning Spark SQL Components for Performance
Introducing performance tuning in Spark SQL
Understanding DataFrame/Dataset APIs
Optimizing data serialization
Understanding Catalyst optimizations
Understanding the Dataset/DataFrame API
Understanding Catalyst transformations
Visualizing Spark application execution
Exploring Spark application execution metrics
Using external tools for performance tuning
Cost-based optimizer in Apache Spark 2.2
Understanding the CBO statistics collection
Statistics collection functions
Filter operator
Join operator
Build side selection
Understanding multi-way JOIN ordering optimization
Understanding performance improvements using whole-stage code generation
Summary
12.  Spark SQL in Large-Scale Application Architectures
Understanding Spark-based application architectures
Using Apache Spark for batch processing
Using Apache Spark for stream processing
Understanding the Lambda architecture
Understanding the Kappa Architecture
Design considerations for building scalable stream processing applications
Building robust ETL pipelines using Spark SQL
Choosing appropriate data formats
Transforming data in ETL pipelines
Addressing errors in ETL pipelines
Implementing a scalable monitoring solution
Deploying Spark machine learning pipelines
Understanding the challenges in typical ML deployment environments
Understanding types of model scoring architectures
Using cluster managers
Summary
16

What this book covers

Chapter 1, Getting Started with Spark SQL, gives you an overview of Spark SQL while getting you

comfortable with the Spark environment through hands-on sessions.

Chapter 2, Using Spark SQL for Processing Structured and Semistructured Data, will help you use Spark

to work with a relational database (MySQL), NoSQL database (MongoDB), semistructured data

(JSON), and data storage formats commonly used in the Hadoop ecosystem (Avro and Parquet).

Chapter 3, Using Spark SQL for Data Exploration, demonstrates the use of Spark SQL to explore

datasets, perform basic data quality checks, generate samples and pivot tables, and visualize data with

Apache Zeppelin.

Chapter 4, Using Spark SQL for Data Munging, uses Spark SQL for performing some basic data

munging/wrangling tasks. It also introduces you to a few techniques to handle missing data, bad data,

duplicate records, and so on.

Chapter 5, Using Spark SQL in Streaming Applications, provides a few examples of using Spark SQL

DataFrame/Dataset APIs to build streaming applications. Additionally, it also shows how to use Kafka in

structured streaming applications.

Chapter 6, Using Spark SQL in Machine Learning Applications, focuses on using Spark SQL in machine

learning applications. In this chapter, we will mainly explore the key concepts in feature engineering and

implement machine learning pipelines.

Chapter 7, Using Spark SQL in Graph Applications, introduces you to GraphFrame applications. It

provides examples of using Spark SQL DataFrame/Dataset APIs to build graph applications and apply

the various graph algorithms into your graph applications.

Chapter 8, Using Spark SQL with SparkR, covers the SparkR architecture and SparkR DataFrames API.

It provides code examples for using SparkR for Exploratory Data Analysis (EDA) and data munging

tasks, data visualization, and machine learning.

Chapter 9, Developing Applications with Spark SQL, helps you build Spark applications using a mix of

Spark modules. It presents examples of applications that combine Spark SQL with Spark Streaming,

Spark Machine Learning, and so on.

Chapter 10, Using Spark SQL in Deep Learning Applications, introduces you to deep learning in Spark.

It covers the basic concepts of a few popular deep learning models before you delve into working with

BigDL and Spark.

Chapter 11, Tuning Spark SQL Components for Performance, presents you with the foundational

concepts related to tuning a Spark application, including data serialization using encoders. It also covers

the key aspects of the cost-based optimizer introduced in Spark 2.2 to optimize Spark SQL execution

automatically.

Chapter 12, Spark SQL in Large-Scale Application Architectures, teaches you to identify the use cases

where Spark SQL can be used in large-scale application architectures to implement typical functional

and non-functional requirements.

剩余416页未读，继续阅读

isun_ljw

粉丝: 0
资源: 11

掌握Spark SQL架构师：实时分析与机器学习实战教程

Learning Spark SQL

Learning Apache Spark 2

51Spark Architect

sql power architect

AWS Solutions Architect Practice Test

aws-solutions-architect-associate-notes：适用于AWS Solutions Architect Associate的笔记

AWS-Solutions-Architect-Notes：包含有关AWS Solutions Architect认证的简要说明-2021

AWS Certified Solutions Architect Official Study Guide

AWS Certified Solutions Architect - Associate题库

AWS官方考试指南-Solutions Architect

最新资源