数据挖掘中的集成方法：通过组合预测提高准确性

需积分: 9 91 浏览量更新于2024-07-17 收藏 2.51MB PDF 举报

"Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions" 本书深入探讨了数据挖掘中的集成方法（Ensemble Methods），旨在为初学者和高级分析研究人员提供指导，特别是针对工程、统计和计算机科学领域的从业者。作者Giovanni Seni和John F. Elder在书中介绍了如何利用集成学习来提升预测的准确性。书中的代码片段以R语言呈现，有助于读者理解和实践所介绍的算法。集成学习是一种机器学习策略，通过结合多个模型的预测结果来提高整体预测性能。这种方法的核心思想是“多样性与平均性”，即多个模型即使单个表现一般，但当它们的错误不完全相同时，结合后的预测会比单个模型更准确。在数据挖掘中，集成方法已经被证明对于提高分类和回归任务的准确性非常有效。本书涵盖了集成学习的基本概念，如bagging（自助采样）、boosting（提升）和stacking（堆叠）。Bagging通过随机子样本从原始数据集中训练多个模型来减少过拟合，例如随机森林（Random Forest）就是一种典型的bagging方法。Boosting则通过迭代地调整数据权重，使得弱学习器逐步改进，如AdaBoost（Adaptive Boosting）和Gradient Boosting。Stacking则是将多个模型的预测作为输入，训练一个元模型来融合这些预测，从而达到更好的性能。书中还可能涉及集成方法的优势和挑战，例如如何创建模型多样性、如何评估和选择基学习器、以及如何有效地组合模型预测。此外，可能会讨论一些实际应用案例，展示如何在不同领域如金融、医疗或社交媒体数据分析中应用集成学习。集成方法不仅限于决策树或神经网络等模型的组合，还可以与支持向量机（SVM）、k-近邻（k-NN）等其他模型结合。通过对这些模型的集成，可以构建出更为健壮且鲁棒的预测系统。最后，书中可能还会讨论如何用R语言实现这些集成学习算法，帮助读者通过实际操作加深理解。R语言因其丰富的统计和机器学习库而成为数据分析和建模的首选工具，如`caret`包可用于模型构建和比较，`randomForest`包用于实现随机森林，`gbm`包用于梯度提升机。 "Ensemble Methods in Data Mining"是一本面向实践者的指南，它提供了理论知识和实用技巧，使读者能够掌握并应用集成学习技术，提升数据挖掘项目中的预测精度。通过阅读此书，无论是新手还是经验丰富的从业者，都能从中受益，构建出更强大的模型。

xvi FOREWORD BY JAFFRAY WOODRIFF

Giovanni and John will be at the forefront of developments in this area, and, if I am lucky, I will be

involved as well.

Jaffray Woodriff

CEO, Quantitative Investment Management

Charlottesville, Virginia

January 2010

[Editor’s note: Mr. Woodriff ’s investment ﬁrm has experienced consistently positive results, and has

grown to be the largest hedge fund manager in the South-East U.S.]

Foreword by Tin Kam Ho

Fruitful solutions to a challenging task have often been found to come from combining an

ensemble of experts. Yet for algorithmic solutions to a complex classiﬁcation task, the utilities of

ensembles were ﬁrst witnessed only in the late 1980’s, when the computing power began to support

the exploration and deployment of a rich set of classiﬁcation methods simultaneously. The next

two decades saw more and more such approaches come into the research arena, and the develop-

ment of several consistently successful strategies for ensemble generation and combination. Today,

while a complete explanation of all the elements remains elusive, the ensemble methodology has

become an indispensable tool for statistical learning. Every researcher and practitioner involved in

predictive classiﬁcation problems can beneﬁt from a good understanding of what is available in this

methodology.

This book by Seni and Elder provides a timely, concise introduction to this topic. After an

intuitive, highly accessible sketch of the key concerns in predictive learning, the book takes the

readers through a shortcut into the heart of the popular tree-based ensemble creation strategies, and

follows that with a compact yet clear presentation of the developments in the frontiers of statistics,

where active attempts are being made to explain and exploit the mysteries of ensembles through

conventional statistical theory and methods. Throughout the book, the methodology is illustrated

with varied real-life examples, and augmented with implementations in R-code for the readers

to obtain ﬁrst-hand experience. For practitioners, this handy reference opens the door to a good

understanding of this rich set of tools that holds high promises for the challenging tasks they face.

For researchers and students, it provides a succinct outline of the critically relevant pieces of the vast

literature, and serves as an excellent summary for this important topic.

The development of ensemble methods is by no means complete. Among the most interesting

open challenges are a more thorough understanding of the mathematical structures, mapping of the

detailed conditions of applicability, ﬁnding scalable and interpretable implementations, dealing with

incomplete or imbalanced training samples, and evolving models to adapt to environmental changes.

It will be exciting to see this monograph encourage talented individuals to tackle these problems in

the coming decades.

Tin Kam Ho

Bell Labs, Alcatel-Lucent

January 2010

2 1. ENSEMBLES DISCOVERED

How can we tell, ahead of time, which algorithm will excel for a given problem? Michie et al.

(1994) addressed this question by executing a similar but larger study (23 algorithms on 22 data

sets) and building a decision tree to predict the best algorithm to use given the properties of a data

set

. Though the study was skewed toward trees — they were 9 of the 23 algorithms, and several of

the (academic) data sets had unrealistic thresholds amenable to trees — the study did reveal useful

lessons for algorithm selection (as highlighted in Elder, J. (1996a)).

Still, there is a way to improve model accuracy that is easier and more powerful than judicious

algorithm selection: one can gather models into ensembles. Figure 1.2 reveals the out-of-sample

accuracy of the models of Figure 1.1 when they are combined four different ways, including aver-

aging, voting, and “advisor perceptrons” (Elder and Lee, 1997). While the ensemble technique of

advisor perceptrons beats simple averaging on every problem, the difference is small compared to the

difference between ensembles and the single models. Every ensemble method competes well here

against the best of the individual algorithms.

This phenomenon was discovered by a handful of researchers, separately and simultaneously,

to improve classiﬁcation whether using decision trees (Ho, Hull, and Srihari, 1990), neural net-

works (Hansen and Salamon, 1990), or math theory (Kleinberg, E., 1990). The most inﬂuential

early developments were by Breiman, L. (1996) with Bagging, and Freund and Shapire (1996) with

AdaBoost (both described in Chapter 4).

One of us stumbled across the marvel of ensembling (which we called “model fusion” or

“bundling”) while striving to predict the species of bats from features of their echo-location sig-

nals (Elder, J., 1996b)

. We built the best model we could with each of several very different

algorithms, such as decision trees, neural networks, polynomial networks, and nearest neighbors

(see Nisbet et al. (2009) for algorithm descriptions). These methods employ different basis func-

tions and training procedures, which causes their diverse surface forms – as shown in Figure 1.3 –

and often leads to surprisingly different prediction vectors, even when the aggregate performance is

very similar.

The project goal was to classify a bat’s species noninvasively, by using only its “chirps.” Univer-

sity of Illinois Urbana-Champaign biologists captured 19 bats, labeled each as one of 6 species, then

recorded 98 signals, from which UIUC engineers calculated 35 time-frequency features

. Figure 1.4

illustrates a two-dimensional projection of the data where each class is represented by a different

color and symbol. The data displays useful clustering but also much class overlap to contend with.

Each bat contributed 3 to 8 signals, and we realized that the set of signals from a given bat had

to be kept together (in either training or evaluation data) to fairly test the model’s ability to predict

a species of an unknown bat.That is, any bat with a signal in the evaluation data must have no other

The researchers (Michie et al., 1994, Section 10.6) examined the results of one algorithm at a time and built a C4.5 decision

tree (Quinlan, J., 1992) to separate those datasets where the algorithm was “applicable” (where it was within a tolerance of

the best algorithm) to those where it was not. They also extracted rules from the tree models and used an expert system to

adjudicate between conﬂicting rules to maximize net “information score.” The book is online at http://www.amsta.leeds.

ac.uk/∼charles/statlog/whole.pdf

Thanks to collaboration with Doug Jones and his EE students at the University of Illinois, Urbana-Champaign.

Features such as low frequency at the 3-decibel level, time position of the signal peak, and amplitude ratio of 1st and 2nd harmonics.

剩余125页未读，继续阅读

qq_39900475

粉丝: 0
资源: 1

数据挖掘中的集成方法：通过组合预测提高准确性

Data Mining Methods and Models

Data Mining and Knowledge Discovery Handbook

Ensemble-based Multi-Filter Feature Selection Method

Advanced Feature Engineering Techniques: 10 Methods to Power Up Your Models

Evaluation Strategies for Imbalanced Datasets: Addressing Data Asymmetry Issues

Dealing with Imbalanced Data: 7 Strategies to Overcome the Challenge

[Practical Exercise] Practical Case Analysis: Using Web Crawlers to Obtain Movie Review Data and ...

一款面向 AIoT 场景的分布式多模数据库产品，支持在同一实例同时建立时序库和关系库并融合处理多模数据

yolo算法-跌倒检测数据集-10787张图像带标签-检测到跌倒fall-detection-ca3o8.zip

重庆外语外事学院在四川2020-2024各专业最低录取分数及位次表.pdf

最新资源