机器学习集成方法探索

需积分: 5 39 浏览量更新于2024-06-18 收藏 25.07MB PDF 举报

"本书《Ensemble Methods for Machine Learning》由Gautam Kunapuli撰写，主要探讨了机器学习中的集成方法。" 集成学习（Ensemble Learning）是一种通过结合多个预测模型来提高整体预测性能的方法。它利用多个学习器的集体智慧，以减少过拟合、增加泛化能力和提高准确性。在本书中，作者详细介绍了两种主要的集成策略：同质化并行（Parallel Homogeneous Ensembles）和异质化并行（Parallel Heterogeneous Ensembles）。 **同质化并行集成**（Parallel Homogeneous Ensembles）：这种策略的核心是使用相同的基础机器学习算法训练多个强学习器，但通过随机数据或特征抽样来创建每个基模型的多样性。例如： 1. **Bagging（Bootstrap Aggregating）**：通过自助采样法训练多个决策树，降低过拟合风险。 2. **Random Forests**：进一步扩展了Bagging，每个树在构建时随机选择特征，增加多样性。 3. **Pasting**：类似于Bagging，但不同的是，它允许部分重叠的子样本。 4. **Random Subspaces**：随机选取特征子集构建决策树，增加多样性。 5. **Random Patches**：在输入空间的特定区域上构建决策树。 6. **Extremely Randomized Trees (ExtraTrees)**：在分裂节点时随机选取最优特征，提高效率。 **异质化并行集成**（Parallel Heterogeneous Ensembles）：这种方法涉及使用不同的基础学习算法训练多个模型，然后通过不同的预测聚合方式将它们结合。比如： 1. **Majority Voting**：简单多数投票，每个学习器独立预测，最后取多数决定。 2. **Entropy-based Prediction Weighting**：基于熵的预测权重，根据模型的预测不确定性分配权重。 3. **Dempster-Shafer Prediction Fusion**：应用Dempster-Shafer理论来融合不确定性的证据。 4. **Meta-learning for Stacking and Blending**：元学习，通过一个学习器来学习其他学习器的预测结果，形成混合预测。书中还提到了如逻辑回归、决策树和多层感知机等基础学习算法在集成学习中的应用。浅层决策树和深度学习模型等也被用作构建多样化集成的组成部分。通过这些方法，集成学习能够克服单个模型可能存在的局限性，实现更强大、更稳健的预测能力。无论是在分类任务还是回归任务中，集成学习都已被证明是一种有效的技术，广泛应用于各种复杂的数据问题中。

ACKNOWLEDGMENTSxiv

and some truly terrific insights and comments. I tried to take in all of your advice (I

really did), and much of it has worked its way into the book.

To the readers who read the book during early access and who left many com-

ments, corrections, and words of encouragement—you know who you are—thank you

for the support!

To my mentors, Kristin Bennett, Jong-Shi Pang, Jude Shavlik, Sriraam Natarajan,

and Maneesh Singh, who have each shaped my thinking profoundly at different stages

of my journey as a student, postdoc, professor, and professional: thank you for teach-

ing me how to think in machine learning, how to speak machine learning, and how to

build with machine learning. Much of your wisdom and many of your lessons endure

in this book. And Kristin, I hope you like the title of the first chapter.

To Jenny and Guilherme de Oliveira, for your friendship over the years, but espe-

cially during the great pandemic, when much of this book was written: thank you for

keeping me sane. I will always treasure our afternoons and evenings in that summer

and fall of 2020, tucked away in your little backyard, our pod and sanctuary.

To my parents, Vijaya and Shivakumar, and my brother, Anupam: thank you for

always believing in me, and for always supporting me, even from tens of thousands of

miles away. I know you’re proud of me. This book is finally finished, and now we can

do all those other things we’re always talking about . . . until I start writing the next

one, anyway.

To my wife, best friend, and biggest champion, Kristine: you’ve been an inexhaust-

ible source of comfort and encouragement, especially when things got tough. Thank

you for bouncing ideas with me, for proofreading with me, for the tea and snacks, for

the Gus, for sacrificing all those weekends (and, sometimes, weeknights) when I was

writing. Thank you for hanging in there with me, for always being there for me, and

for never once doubting that I could do this. I love you!

ABOUT THIS BOOKxvi

 MLOps and DataOps engineers who are building, evaluating, and deploying

ensemble-based, production-ready applications and pipelines

 Students of data science and machine learning who want to use this book as a

learning resource or as a practical reference to supplement textbooks

 Kagglers and data science enthusiasts who can use this book as an entry point

into learning about the endless modeling possibilities with ensemble methods

This book is not an introduction to machine learning and data science. This book

assumes that you have some basic working knowledge of machine learning and that

you’ve used or played around with at least one fundamental learning technique (e.g.,

decision trees).

A basic working knowledge of Python is also assumed. Examples, visualizations,

and chapter case studies all use Python and Jupyter Notebooks. Knowledge of other

commonly used Python packages such as NumPy (for mathematical computations),

pandas (for data manipulation), and Matplotlib (for visualization) is useful, but not

necessary. In fact, you can learn how to use these packages through the examples and

case studies.

How this book is organized: A road map

This book is organized into nine chapters in three parts. Part 1 is a gentle introduc-

tion to ensemble methods, part 2 introduces and explains several essential ensemble

methods, and part 3 covers advanced topics.

Part 1, “The basics of ensembles,” introduces ensemble methods and why you

should care about them. This part also contains a road map of ensemble methods cov-

ered in the rest of the book:

 Chapter 1 discusses ensemble methods and basic ensemble terminology. It also

introduces the fit-versus-complexity tradeoff (or the bias-variance tradeoff, as

it’s more formally called). You’ll build your very first ensemble in this chapter.

Part 2, “Essential ensemble methods,” covers several important families of ensemble

methods, many of which are considered “essential” and are widely used in real-world

applications. In each chapter, you’ll learn how to implement different ensemble

methods from scratch, how they work, and how to apply them to real-world problems:

 Chapter 2 begins our journey with parallel ensemble methods, specifically, par-

allel homogeneous ensembles. Ensemble methods covered include bagging,

random forests, pasting, random subspaces, random patches, and Extra Trees.

 Chapter 3 continues the journey with more parallel ensembles, but the focus in

this chapter is on parallel heterogeneous ensembles. Ensemble methods cov-

ered include combining base models by majority voting, combining by weight-

ing, prediction fusion with Dempster-Shafer, and meta-learning by stacking.

 Chapter 4 introduces another family of ensemble methods—sequential adap-

tive ensembles—in particular, the fundamental concept of boosting many weak

ABOUT THIS BOOK xvii

models into one powerful model. Ensemble methods covered include Ada-

Boost and LogitBoost.

 Chapter 5 builds on the foundational concepts of boosting and covers another

fundamental sequential ensemble method, gradient boosting, which combines

gradient descent with boosting. This chapter discusses how we can train

gradient-boosting ensembles with scikit-learn and LightGBM.

 Chapter 6 continues to explore sequential ensemble methods with Newton

boosting, an efficient and effective extension of gradient boosting that com-

bines Newton’s descent with boosting. This chapter discusses how we can train

Newton boosting ensembles with XGBoost.

Part 3, “Ensembles in the wild: Adapting ensemble methods to your data,” shows you

how to apply ensemble methods to many scenarios, including data sets with continu-

ous and count-valued labels and data sets with categorical features. You’ll also learn

how to interpret your ensembles and explain their predictions:

 Chapter 7 shows how we can train ensembles for different types of regression

problems and generalized linear models, where training labels are continuous-

or count-valued. Parallel and sequential ensembles for linear regression, Poisson

regression, gamma regression, and Tweedie regression are covered.

 Chapter 8 identifies challenges in learning with nonnumeric features, specifi-

cally, categorical features, and encoding schemes that will help us train effective

ensembles for this kind of data. This chapter also discusses two important prac-

tical issues: data leakage and prediction shift. Finally, we’ll see how to overcome

these issues with ordered boosting and CatBoost.

 Chapter 9 covers the newly emerging and very important topic of explainable

AI from the perspective of ensemble methods. This chapter introduces the

notion of explainability and why it’s important. Several common black-box

explainability methods are also discussed, including permutation feature

importance, partial dependence plots, surrogate methods, Locally Interpreta-

ble Model-Agnostic Explanation, Shapley values, and SHapley Additive exPlana-

tions. The glass-box ensemble method, explainable boosting machines, and the

InterpretML package are also introduced.

 The epilogue concludes our journey with additional topics for further explora-

tion and reading.

While most of the chapters in the book can reasonably be read in a standalone man-

ner, chapters 7, 8, and 9 build on part 2 of the book.

About the code

All the code and examples in this book are written in Python 3. The code is organized

into Jupyter Notebooks and is available in an online GitHub repository (https://github

.com/gkunapuli/ensemble-methods-notebooks) and for download from the Manning

website (www.manning.com/books/ensemble-methods-for-machine-learning). You

ABOUT THIS BOOKxviii

can get executable snippets of code from the liveBook (online) version of this book at

https://livebook.manning.com/book/ensemble-methods-for-machine-learning.

Several Python scientific and visualization libraries are also used, including NumPy

(https://numpy.org/), SciPy (https://scipy.org/), pandas (https://pandas.pydata

.org/), and Matplotlib (https://matplotlib.org/). The code also uses several Python

machine-learning and ensemble-method libraries, including scikit-learn (https://

scikit-learn.org/stable/), LightGBM (https://lightgbm.readthedocs.io/), XGBoost

(https://xgboost.readthedocs.io/), CatBoost (https://catboost.ai/), and InterpretML

(https://interpret.ml/).

This book contains many examples of source code both in numbered listings and

in line with normal text. In both cases, source code is formatted in a

fixed-width

font like this

to separate it from ordinary text. In many cases, the original source

code has been reformatted; we’ve added line breaks and reworked indentation to

accommodate the available page space in the book. Additionally, comments in the

source code have often been removed from the listings when the code is described in

the text. Code annotations accompany many of the listings, highlighting important

concepts.

liveBook discussion forum

Purchase of Ensemble Methods for Machine Learning includes free access to liveBook,

Manning’s online reading platform. Using liveBook’s exclusive discussion features,

you can attach comments to the book globally or to specific sections or paragraphs.

It’s a snap to make notes for yourself, ask and answer technical questions, and receive

help from the author and other users. To access the forum, go to https://livebook

.manning.com/book/ensemble-methods-for-machine-learning/discussion. You can

also learn more about Manning’s forums and the rules of conduct at https://livebook

.manning.com/discussion.

Manning’s commitment to our readers is to provide a venue where a meaningful

dialogue between individual readers and between readers and the author can take

place. It’s not a commitment to any specific amount of participation on the part of the

author, whose contribution to the forum remains voluntary (and unpaid). We suggest

you try asking the author some challenging questions lest his interest stray! The forum

and the archives of previous discussions will be accessible from the publisher’s website

as long as the book is in print.

剩余353页未读，继续阅读

死磕代码程序媛

粉丝: 136
资源: 320

机器学习集成方法探索

Machine Learning.pdf

Ensemble Learning.pdf

Ensemble Machine Learning.Methods and Applications.2013.pdf

Machine.Learning.in.Python

AOSOLogitBoost_.zip_ensemble matlab_machine learning

Hands－On.Machine.Learning.with.Scikit－Learn.and.TensorFlow.2017

Practical.Machine.Learning.178439968X

Machine Learning in Python 无水印pdf 0分

Machine Learning for Text

Ensemble Methods Foundations and Algorithms

最新资源