使用Python进行大规模机器学习实战

需积分: 9 6 浏览量更新于2024-07-20 收藏 4.46MB PDF 举报

"大规模机器学习与Python" 在"大型规模机器学习与Python"这一主题中，我们探讨的是如何利用Python这一强大的编程语言来处理大数据并构建高效、可扩展的机器学习模型。Python因其丰富的库和易读性而在数据科学领域广受欢迎，尤其适合于构建大规模机器学习项目。首先，Python中的库如NumPy、Pandas和SciPy提供了高效的数值计算和数据分析功能，使得处理大规模数据变得可能。NumPy是Python中用于科学计算的基础包，提供多维数组对象和矩阵运算；Pandas则提供了一种灵活的数据结构DataFrame，便于数据清洗和预处理；而SciPy则包含了一系列用于优化、统计和信号处理的工具。接着，Scikit-Learn是Python中最主要的机器学习库，它提供了一系列训练模型的算法，如线性回归、逻辑回归、支持向量机（SVM）、决策树、随机森林以及聚类算法等。对于大规模数据集，Scikit-Learn还支持部分拟合（partial fitting）或批次训练，允许在内存有限的情况下处理大样本。除此之外，为了实现分布式计算和处理大规模数据，我们可以使用Apache Spark。Spark提供了PySpark接口，允许Python开发者利用其内存计算框架进行机器学习。Spark的MLlib库包含了多种机器学习算法，并且能够与Hadoop Distributed File System (HDFS)和其他大数据存储系统无缝集成。深度学习是机器学习的一个重要分支，Python中的TensorFlow和Keras库使得构建深度神经网络变得更加容易。TensorFlow是一个强大的开源库，支持定义、训练和部署各种复杂的计算模型，而Keras则是一个高级神经网络API，简化了TensorFlow的使用，使得快速原型设计和实验成为可能。在实际应用中，我们还需要考虑数据的预处理，包括特征选择、归一化、缺失值处理和异常值检测。Python的FeatureHasher、PCA（主成分分析）和OneHotEncoder等工具可以帮助我们完成这些任务。最后，部署大规模机器学习模型时，可以使用Flask或Django等Web框架将模型包装成API，以便于在生产环境中使用。同时，监控和评估模型的性能也是必不可少的，Python的Matplotlib和Seaborn库可用于可视化结果，而ModelDB或MLflow等工具可以帮助我们管理模型版本和实验。总结来说，Python提供了一整套工具链，从数据处理、模型训练到模型部署，都为大规模机器学习项目提供了强有力的支持。通过熟练掌握这些工具和技术，数据科学家和机器学习工程师可以构建出强大的预测性应用程序，应对各种大数据挑战。

[ vii ]

Preface

"The nice thing about having a brain is that one can learn, that ignorance can

be supplanted by knowledge, and that small bits of knowledge can gradually

pile up into substantial heaps."

– Douglas Hofstadter

Machine learning is often referred to as the part of articial intelligence that actually

works. Its aim is to nd a function based on an existing set of data (training set) in

order to predict outcomes of a previously unseen dataset (test set) with the highest

possible correctness. This occurs either in the form of labels and classes (classication

problems) or in the form of a continuous value (regression problems). Tangible

examples of machine learning in real-life applications range from predicting

future stock prices to classifying the gender of an author from a set of documents.

Throughout this book, the most important machine learning concepts, together

with methods suitable for larger datasets, will be made clear to the reader, thanks to

practical examples in Python. We will look at supervised learning (classication &

regression), as well as unsupervised learning (such as Principal Component Analysis

(PCA), clustering, and topic modeling) that have been found to be applicable to

larger datasets.

Large IT corporations such as Google, Facebook, and Uber have generated a lot of

buzz by claiming that they successfully applied such machine learning methods at

a large scale. With the onset and availability of big data, the demand for scalable

machine learning solutions has grown exponentially and many other companies

and individuals have started aspiring to ripe the fruits of hidden correlations in big

datasets. Unfortunately, most learning algorithms don't scale well, straining CPUs

and memory either on a desktop computer or on a larger computing cluster. During

these times, even if big data has passed the peak of hype, scalable machine learning

solutions are not plentiful.

Preface

[ viii ]

Frankly, we still need to work around a lot of bottlenecks even with datasets we

would hardly categorize as big data (think of datasets up to 2GB or even smaller). The

mission of this book is to provide methods (and sometimes unconventional ones) to

apply the most powerful open source machine learning methods at a larger scale,

without the need for expensive enterprise solutions or large computing clusters.

Throughout this book, we will use Python and some other readily available solutions

that integrate well in scalable machine learning pipelines. Reading the book is a

journey that will redene what you knew about machine learning, setting you on the

starting blocks of real big data analysis.

What this book covers

Chapter 1, First Steps to Scalability, sets the problem of scalable machine learning

under the right perspective and familiarizes you with the tools that we will be

using in this book.

Chapter 2, Scalable Learning in Scikit-learn, discusses strategies for stochastic gradient

descent (SGD) where we mitigate memory consumption; it is based on the theme of

out-of-core learning. We will also deal with data preparation techniques that can deal

with a variety of data, such as the hashing trick.

Chapter 3, Fast-Learning SVMs, covers streaming algorithms that are capable

of discovering non-linearity in the form of support vector machines. We will

present alternatives to Scikit-learn, such as LIBLINEAR and Vowpal Wabbit, which,

although operating as external shell commands, can be easily wrapped and directed

by Python scripts.

Chapter 4, Neural Networks and Deep Learning, provides useful tactics for applying

deep neural networks within the Theano framework together with large-scale

applications with H2O. Even though it is a hot topic, it can be quite a challenge

to apply it successfully, let alone provide scalable solutions. We will also resort to

unsupervised pre-training with autoencoders with the theanets package.

Chapter 5, Deep Learning with TensorFlow, covers interesting deep learning techniques

together with an online method for neural networks. Although TensorFlow is only

in its infancy, the framework provides elegant machine learning solutions. We

will also utilize Keras Convolutional Neural Networks capabilities within the

TensorFlow environment.

Chapter 6, Classication and Regression Trees at Scale, explains scalable solutions for

random forest, gradient boosting, and XGboost. CART, an acronym for classication

and regression trees, is a machine learning method usually applied in the framework

of ensemble methods. We will also provide examples of a large-scale application

using H2O.

Preface

[ ix ]

Chapter 7, Unsupervised Learning at Scale, dives into unsupervised learning, as we will

cover PCA, cluster analysis, and topic modeling using the right approach for scaling

them up.

Chapter 8, Distributed Environments – Hadoop and Spark, teaches us how to set up

Spark within a virtual machine environment, shifting from a single machine to a

computational network paradigm. As Python can easily glue and power up our

efforts on a cluster of machines, it becomes a piece of cake to leverage the power of a

Hadoop cluster.

Chapter 9, Practical Machine Learning with Spark, gets into action with Spark, teaching

all the essentials for starting immediately to manipulate data and build predictive

models on large datasets.

Appendix, Introduction to GPUs and Theano, will cover the basics of Theano and

GPU-computation. It will help you install and prepare your environment for using

Theano on the GPU, if your system allows it.

What you need for this book

The execution of the code examples provided in this book requires an installation of

Python 2.7 or higher versions on macOS, Linux, or Microsoft Windows.

The examples throughout the book will make frequent use of Python's essential

libraries, such as SciPy, NumPy, Scikit-learn, and StatsModels, and to a minor extent,

matplotlib and pandas, for scientic and statistical computing. We will also make use

of an out-of-core cloud computing application called H2O.

This book is highly dependent on Jupyter and its Notebooks powered by the Python

kernel. We will use its most recent version, 4.1, for this book.

The rst chapter will provide you with all the step-by-step instructions and some

useful tips to set up your Python environment, these core libraries, and all the

necessary tools.

Who this book is for

This book is suitable for aspiring and actual data science practitioners, developers,

and everyone who intends to work with large and complex datasets. We strive

to make this book as accessible as possible to a wider audience. Yet, considering

that the topics in this book are quite advanced, it is recommended, but not strictly

compulsory, that readers are familiar with basic machine learning concept such as

classication and regression, error minimizing functions, and cross validation.

剩余419页未读，继续阅读

sceneline

粉丝: 0

使用Python进行大规模机器学习实战

Large Scale Machine Learning with Python

building machine learning system with python

deep learning with python

Large Scale Machine Learning with Python epub

Large Scale Machine Learning with Python.rar

Large Scale Machine Learning with Python -- 2016 -- code.7z

TensorFlow_ Large-Scale Machine Learning

Large-Scale-Machine-Learning-With-Python:Packt发行的《使用Python进行大规模机器学习的代码库》

Machine Learning with TensorFlow

Machine Learning with TensorFlow.pdf

最新资源