数据挖掘中集成方法：提升预测准确性的策略

Data

Mining

需积分: 8 117 浏览量更新于2023-03-16 收藏 2.51MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

《数据挖掘中的集成方法：通过组合预测提高准确性》是一本由Morgan & Claypool出版社出版的专业书籍，编者是罗伯特·格罗斯曼（University of Illinois, Chicago），该书属于"数据挖掘与知识发现合成讲座系列"的一部分。该系列旨在深入探讨数据挖掘领域的关键主题，本卷的第2讲着重于集成方法在数据挖掘中的应用。集成方法（Ensemble Methods）是一种强大的机器学习策略，它通过结合多个个体模型或算法的预测结果来提升整体性能，从而提高数据分析的准确性和稳定性。这种方法的核心理念是“集体智慧”，即通过集合多个模型的优点，减少单个模型可能的过拟合或偏差，尤其是在处理复杂问题时，如文本挖掘、社交网络分析等。在《数据挖掘中的集成方法：通过组合预测》一书中，作者Giovanni Seni 和 John F. Elder详细阐述了集成方法的不同类型，例如： 1. **Bagging（Bootstrap Aggregating）**：通过重复采样训练集并构建多个独立模型，降低随机误差。 2. **Boosting**：逐步调整模型权重，让弱分类器组成一个强学习器，特别适合处理不平衡数据集。 3. **Stacking**：将多个模型的输出作为新特征输入到另一个模型，形成元模型。 4. **Random Forests**：结合决策树的集成方法，每个决策树基于部分特征进行预测，最终投票决定结果。 5. **AdaBoost**：通过动态调整样本权重，优先处理难以分类的样本，逐步改进模型。书中还讨论了集成方法在博客领域（如2009年的《微博数据挖掘中的建模与集成》）的实际应用，展示了如何通过这些技术挖掘社交媒体中的用户行为、情绪趋势以及潜在的商业价值。版权方面，所有内容受版权保护，未经Morgan & Claypool出版社许可，禁止任何形式的复制、存储或传输。本书提供纸质版（ISBN: 9781608452842）和电子版（ISBN: 9781608452859）供读者选择，并提供了数字对象标识符DOI: 10.2200/S00240ED1V01Y200912DMK002，表明其在学术界的地位和价值。《数据挖掘中的集成方法：通过组合预测》是一本深度讲解集成方法在提高数据挖掘精度方面的实用指南，对于数据科学家、机器学习工程师以及对高级数据处理感兴趣的研究人员来说，是一本不可或缺的参考文献。

资源详情

资源推荐

xvi FOREWORD BY JAFFRAY WOODRIFF

Giovanni and John will be at the forefront of developments in this area, and, if I am lucky, I will be

involved as well.

Jaffray Woodriff

CEO, Quantitative Investment Management

Charlottesville, Virginia

January 2010

[Editor’s note: Mr. Woodriff ’s investment ﬁrm has experienced consistently positive results, and has

grown to be the largest hedge fund manager in the South-East U.S.]

Foreword by Tin Kam Ho

Fruitful solutions to a challenging task have often been found to come from combining an

ensemble of experts. Yet for algorithmic solutions to a complex classiﬁcation task, the utilities of

ensembles were ﬁrst witnessed only in the late 1980’s, when the computing power began to support

the exploration and deployment of a rich set of classiﬁcation methods simultaneously. The next

two decades saw more and more such approaches come into the research arena, and the develop-

ment of several consistently successful strategies for ensemble generation and combination. Today,

while a complete explanation of all the elements remains elusive, the ensemble methodology has

become an indispensable tool for statistical learning. Every researcher and practitioner involved in

predictive classiﬁcation problems can beneﬁt from a good understanding of what is available in this

methodology.

This book by Seni and Elder provides a timely, concise introduction to this topic. After an

intuitive, highly accessible sketch of the key concerns in predictive learning, the book takes the

readers through a shortcut into the heart of the popular tree-based ensemble creation strategies, and

follows that with a compact yet clear presentation of the developments in the frontiers of statistics,

where active attempts are being made to explain and exploit the mysteries of ensembles through

conventional statistical theory and methods. Throughout the book, the methodology is illustrated

with varied real-life examples, and augmented with implementations in R-code for the readers

to obtain ﬁrst-hand experience. For practitioners, this handy reference opens the door to a good

understanding of this rich set of tools that holds high promises for the challenging tasks they face.

For researchers and students, it provides a succinct outline of the critically relevant pieces of the vast

literature, and serves as an excellent summary for this important topic.

The development of ensemble methods is by no means complete. Among the most interesting

open challenges are a more thorough understanding of the mathematical structures, mapping of the

detailed conditions of applicability, ﬁnding scalable and interpretable implementations, dealing with

incomplete or imbalanced training samples, and evolving models to adapt to environmental changes.

It will be exciting to see this monograph encourage talented individuals to tackle these problems in

the coming decades.

Tin Kam Ho

Bell Labs, Alcatel-Lucent

January 2010

2 1. ENSEMBLES DISCOVERED

How can we tell, ahead of time, which algorithm will excel for a given problem? Michie et al.

(1994) addressed this question by executing a similar but larger study (23 algorithms on 22 data

sets) and building a decision tree to predict the best algorithm to use given the properties of a data

set

. Though the study was skewed toward trees — they were 9 of the 23 algorithms, and several of

the (academic) data sets had unrealistic thresholds amenable to trees — the study did reveal useful

lessons for algorithm selection (as highlighted in Elder, J. (1996a)).

Still, there is a way to improve model accuracy that is easier and more powerful than judicious

algorithm selection: one can gather models into ensembles. Figure 1.2 reveals the out-of-sample

accuracy of the models of Figure 1.1 when they are combined four different ways, including aver-

aging, voting, and “advisor perceptrons” (Elder and Lee, 1997). While the ensemble technique of

advisor perceptrons beats simple averaging on every problem, the difference is small compared to the

difference between ensembles and the single models. Every ensemble method competes well here

against the best of the individual algorithms.

This phenomenon was discovered by a handful of researchers, separately and simultaneously,

to improve classiﬁcation whether using decision trees (Ho, Hull, and Srihari, 1990), neural net-

works (Hansen and Salamon, 1990), or math theory (Kleinberg, E., 1990). The most inﬂuential

early developments were by Breiman, L. (1996) with Bagging, and Freund and Shapire (1996) with

AdaBoost (both described in Chapter 4).

One of us stumbled across the marvel of ensembling (which we called “model fusion” or

“bundling”) while striving to predict the species of bats from features of their echo-location sig-

nals (Elder, J., 1996b)

. We built the best model we could with each of several very different

algorithms, such as decision trees, neural networks, polynomial networks, and nearest neighbors

(see Nisbet et al. (2009) for algorithm descriptions). These methods employ different basis func-

tions and training procedures, which causes their diverse surface forms – as shown in Figure 1.3 –

and often leads to surprisingly different prediction vectors, even when the aggregate performance is

very similar.

The project goal was to classify a bat’s species noninvasively, by using only its “chirps.” Univer-

sity of Illinois Urbana-Champaign biologists captured 19 bats, labeled each as one of 6 species, then

recorded 98 signals, from which UIUC engineers calculated 35 time-frequency features

. Figure 1.4

illustrates a two-dimensional projection of the data where each class is represented by a different

color and symbol. The data displays useful clustering but also much class overlap to contend with.

Each bat contributed 3 to 8 signals, and we realized that the set of signals from a given bat had

to be kept together (in either training or evaluation data) to fairly test the model’s ability to predict

a species of an unknown bat.That is, any bat with a signal in the evaluation data must have no other

The researchers (Michie et al., 1994, Section 10.6) examined the results of one algorithm at a time and built a C4.5 decision

tree (Quinlan, J., 1992) to separate those datasets where the algorithm was “applicable” (where it was within a tolerance of

the best algorithm) to those where it was not. They also extracted rules from the tree models and used an expert system to

adjudicate between conﬂicting rules to maximize net “information score.” The book is online at http://www.amsta.leeds.

ac.uk/∼charles/statlog/whole.pdf

Thanks to collaboration with Doug Jones and his EE students at the University of Illinois, Urbana-Champaign.

Features such as low frequency at the 3-decibel level, time position of the signal peak, and amplitude ratio of 1st and 2nd harmonics.

剩余125页未读，继续阅读

tracylhp

粉丝: 1
资源: 17

会员权益专享

数据挖掘中集成方法：提升预测准确性的策略

Data Mining and Knowledge Discovery Handbook

Data Mining

Data Mining: Practical Machine Learning Tools and Techniques

Ensemble-based methods是什么

smote和Ensemble-based methods结合有什么用，代码实现一下一下

smote和Ensemble-based methods如何结合

smote和Ensemble-based methods如何结合，代码实现一下一下

Why the bagging ensemble leads to a more sensible decision boundary?

from sklearn.ensemble import RandomForestClassifier

请给出介绍CART决策树的参考文献

DecisionTreeClassifier()

Principles and Theory for Data Mining and Machine Learning

会员权益专享

最新资源