Preface
[ 2 ]
However, MapReduce has some important shortcomings, including high overheads
to launch each job and reliance on storing intermediate data and results of the
computation to disk, both of which make Hadoop relatively ill-suited for use cases of
an iterative or low-latency nature. Apache Spark is a new framework for distributed
computing that is designed from the ground up to be optimized for low-latency
tasks and to store intermediate data and results in memory, thus addressing some of
the major drawbacks of the Hadoop framework. Spark provides a clean, functional,
and easy-to-understand API to write applications and is fully compatible with the
Hadoop ecosystem.
Furthermore, Spark provides native APIs in Scala, Java, and Python. The Scala and
Python APIs allow all the benets of the Scala or Python language, respectively,
to be used directly in Spark applications, including using the relevant interpreter
for real-time, interactive exploration. Spark itself now provides a toolkit (called
MLlib) of distributed machine learning and data mining models that is under heavy
development and already contains high-quality, scalable, and efcient algorithms for
many common machine learning tasks, some of which we will delve into in this book.
Applying machine learning techniques to massive datasets is challenging, primarily
because most well-known machine learning algorithms are not designed for parallel
architectures. In many cases, designing such algorithms is not an easy task. The
nature of machine learning models is generally iterative, hence the strong appeal
of Spark for this use case. While there are many competing frameworks for parallel
computing, Spark is one of the few that combines speed, scalability, in-memory
processing, and fault tolerance with ease of programming and a exible, expressive,
and powerful API design.
Throughout this book, we will focus on real-world applications of machine learning
technology. While we may briey delve into some theoretical aspects of machine
learning algorithms, the book will generally take a practical, applied approach with
a focus on using examples and code to illustrate how to effectively use the features
of Spark and MLlib, as well as other well-known and freely available packages for
machine learning and data analysis, to create a useful machine learning system.
What this book covers
Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local
development environment for the Spark framework as well as how to create a Spark
cluster in the cloud using Amazon EC2. The Spark programming model and API will
be introduced, and a simple Spark application will be created using each of Scala,
Java, and Python.