数据挖掘十大算法详解 - CRC (2009)

需积分: 10 60 浏览量更新于2024-07-17 收藏 2.95MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"The Top Ten Algorithms in Data Mining -CRC (2009)" 是一本关于数据挖掘领域重要算法的书籍，提供了清晰的英文文本，适合复制和拷贝。这本书由Taylor & Francis Group出版，包含了数据挖掘领域的十大算法，是数据科学家和相关专业人士的重要参考资料。在数据挖掘领域，算法的选择和理解对于解决问题至关重要。以下是这本经典著作中可能包含的一些关键算法及其应用： 1. **ID3决策树**：ID3（Iterative Dichotomiser 3）算法是由Ross Quinlan开发的，用于构建分类决策树。它基于信息增益度量来选择最佳特征进行划分，以最大程度地减少数据集的熵。 2. **C4.5决策树**：作为ID3的升级版，C4.5引入了更优的剪枝策略和处理连续值的能力，提高了决策树的准确性和泛化能力。 3. **随机森林**：由Leo Breiman提出的随机森林是一种集成学习方法，通过构建多个决策树并取其平均结果来降低过拟合风险，提高模型的稳定性和准确性。 4. **K-近邻（K-NN）**：K-NN是一种基于实例的学习方法，用于分类和回归。它根据最近邻的类别或属性值进行预测，K值的选择对模型性能有很大影响。 5. **朴素贝叶斯**：朴素贝叶斯算法基于贝叶斯定理，假设特征之间相互独立，常用于文本分类和垃圾邮件过滤。 6. **支持向量机（SVM）**：由Vapnik提出，SVM通过构造最大间隔超平面来进行分类，能处理高维空间问题，尤其适用于小样本、非线性及高维模式识别。 7. **AdaBoost**： AdaBoost（Adaptive Boosting）是一种迭代的弱学习器组合算法，通过赋予错误分类样本更高的权重，逐步提升弱分类器的性能。 8. **EM算法（期望最大化）**：用于含有隐变量的概率模型参数估计，如混合高斯模型，通过迭代调整期望和最大化步骤来优化模型。 9. **Apriori关联规则挖掘**：由Agrawal等人提出，Apriori算法用于发现数据库中的频繁项集和强关联规则，是市场篮子分析的基础。 10. **PageRank**：由Google创始人提出的网页排名算法，考虑了网页之间的链接结构，衡量一个网页的重要性，是搜索引擎优化的关键组成部分。这些算法在实际应用中各有优势和适用场景，理解和掌握它们对于提升数据分析和挖掘项目的效率至关重要。本书详细介绍了这些算法的工作原理、实现细节以及如何在实际问题中应用，是学习数据挖掘技术的宝贵资源。

资源详情

资源推荐

1.2 Algorithm Description 3

of ID3 and C4.5. Today, C4.5 is superseded by the See5/C5.0 system, a commercial

product offered by Rulequest Research, Inc.

The fact that two of the top 10 algorithms are tree-based algorithms attests to

the widespread popularity of such methods in data mining. Original applications of

decision trees were in domains with nominal valued or categorical data but today

they span a multitude of domains with numeric, symbolic, and mixed-type attributes.

Examples include clinical decision making, manufacturing, document analysis, bio-

informatics, spatial data modeling (geographic information systems), and practically

any domain where decision boundaries between classes can be captured in terms of

tree-like decompositions or regions identiﬁed by rules.

1.2 Algorithm Description

C4.5 is not one algorithm but rather a suite of algorithms—C4.5, C4.5-no-pruning,

and C4.5-rules—with many features. We present the basic C4.5 algorithm ﬁrst and

the special features later.

The generic description of how C4.5 works is shown in Algorithm 1.1. All tree

induction methods begin with a root node that represents the entire, given dataset and

recursively split the data into smaller subsets by testing for a given attribute at each

node. The subtrees denote the partitions of the original dataset that satisfy speciﬁed

attribute value tests. This process typically continues until the subsets are “pure,” that

is, all instances in the subset fall in the same class, at which time the tree growing is

terminated.

Algorithm 1.1 C4.5(D)

Input: an attribute-valued dataset D

1: Tree = {}

if D is “pure” OR other stopping criteria met then

3: terminate

4: end if

5: for all attribute a ∈ D do

6: Compute information-theoretic criteria if we split on a

7: end for

8: a

best

= Best attribute according to above computed criteria

9: Tree = Create a decision node that tests a

best

in the root

10: D

= Induced sub-datasets from D based on a

best

11: for all D

12: Tree

= C4.5(D

)

13:

Attach Tree

to the corresponding branch of Tree

14: end for

15: return Tree

4 C4.5

Yes

No No

Outlook

Humidity

Windy

Sunny Rainy

Overcast

>75<=75 FalseTrue

Figure 1.2 Decision tree induced by C4.5 for the dataset of Figure 1.1.

Figure 1.1 presents the classical “golf” dataset, which is bundled with the C4.5

installation. As stated earlier, the goal is to predict whether the weather conditions

on a particular day are conducive to playing golf. Recall that some of the features are

continuous-valued while others are categorical.

Figure 1.2 illustrates the tree induced by C4.5 using Figure 1.1 as training data

(and the default options). Let us look at the various choices involved in inducing such

trees from the data.

What types of tests are possible? As Figure 1.2 shows, C4.5 is not restricted

to considering binary tests, and allows tests with two or more outcomes. If the

attribute is Boolean, the test induces two branches. If the attribute is categorical,

the test is multivalued, but different values can be grouped into a smaller set of

options with one class predicted for each option. If the attribute is numerical,

then the tests are again binary-valued, and of the form {≤ θ?,> θ?}, where θ

is a suitably determined threshold for that attribute.

How are tests chosen? C4.5 uses information-theoretic criteria such as gain

(reduction in entropy of the class distribution due to applying a test) and

gain ratio (a way to correct for the tendency of gain to favor tests with many

outcomes). The default criterion is gain ratio. At each point in the tree-growing,

the test with the best criteria is greedily chosen.

How are test thresholds chosen? As stated earlier, for Boolean and categorical

attributes, the test values are simply the different possible instantiations of that

attribute. For numerical attributes, the threshold is obtained by sorting on that

attribute and choosing the split between successive values that maximize the

criteria above. Fayyad and Irani [10] showed that not all successive values need

to be considered. For two successive values v

and v

i+1

of a continuous-valued

1.2 Algorithm Description 5

attribute, if all instances involving v

and all instances involving v

i+1

belong to

the same class, then splitting between them cannot possibly improve informa-

tion gain (or gain ratio).

How is tree-growing terminated? A branch from a node is declared to lead

to a leaf if all instances that are covered by that branch are pure. Another way

in which tree-growing is terminated is if the number of instances falls below a

speciﬁed threshold.

Howareclass labels assigned to the leaves? The majority class of the instances

assigned to the leaf is taken to be the class prediction of that subbranch of the

tree.

The above questions are faced by any classiﬁcation approach modeled after trees and

similar, or other reasonable, decisions are made by most tree induction algorithms.

The practical utility of C4.5, however, comes from the next set of features that build

upon the basic tree induction algorithm above. But before we present these features,

it is instructive to instantiate Algorithm 1.1 for a simple dataset such as shown in

Figure 1.1.

We will work out in some detail how the tree of Figure 1.2 is induced from

Figure 1.1. Observe how the ﬁrst attribute chosen for a decision test is the Outlook

attribute. To see why, let us ﬁrst estimate the entropy of the class random variable

(PlayGolf?). This variable takes two values with probability 9/14 (for “Yes”) and

5/14 (for “No”). The entropy of a class random variable that takes on c values with

probabilities p

, p

,...,p

is given by:



i=1

−p

log

The entropy of PlayGolf? is thus

−(9/14) log

(9/14) − (5/14) log

(5/14)

or 0.940. This means that on average 0.940 bits must be transmitted to communicate

information about the PlayGolf? random variable. The goal of C4.5 tree induction is

to ask the right questions so that this entropy is reduced. We consider each attribute in

turn to assess the improvement in entropy that it affords. For a given random variable,

say Outlook, the improvement in entropy, represented as Gain(Outlook), is calculated

as:

Entropy(PlayGolf? in D) −



|D|

Entropy(PlayGolf? in D

)

where v is the set of possible values (in this case, three values for Outlook), D denotes

the entire dataset, D

is the subset of the dataset for which attribute Outlook has that

value, and the notation |·|denotes the size of a dataset (in the number of instances).

This calculation will show that Gain(Outlook) is 0.940−0.694 = 0.246. Similarly,

we can calculate that Gain(Windy) is 0.940 −0.892 = 0.048. Working out the above

calculations for the other attributes systematically will reveal that Outlook is indeed

6 C4.5

the best attribute to branch on. Observe that this is a greedy choice and does not take

into account the effectof future decisions. As stated earlier, the tree-growingcontinues

till termination criteria such as purity of subdatasets are met. In the above example,

branching on the value “Overcast” for Outlook results in a pure dataset, that is, all

instances having this value for Outlook have the value “Yes” for the class variable

PlayGolf?; hence, the tree is not grownfurther in that direction. However, the other two

values for Outlook still induce impure datasets. Therefore the algorithm recurses, but

observe that Outlook cannot be chosen again (why?). For different branches, different

test criteria and splits are chosen, although, in general, duplication of subtrees can

possibly occur for other datasets.

We mentioned earlier that the default splitting criterion is actually the gain ratio, not

the gain. To understand the difference, assume we treated the Day column in Figure 1.1

as if it were a “real” feature. Furthermore, assume that we treat it as a nominal valued

attribute. Of course, each day is unique, so Day is really not a useful attribute to

branch on. Nevertheless, because there are 14 distinct values for Day and each of

them induces a “pure” dataset (a trivial dataset involving only one instance), Day

would be unfairly selected as the best attribute to branch on. Because information

gain favors attributes that contain a large number of values, Quinlan proposed the

gain ratio as a correction to account for this effect. The gain ratio for an attribute a is

deﬁned as:

GainRatio(a) =

Gain(a)

Entropy(a)

Observe that entropy(a) does not depend on the class information and simply takes

into account the distribution of possible values for attribute a, whereas gain(a) does

take into account the class information. (Also, recall that all calculations here are

dependent on the dataset used, although we haven’t made this explicit in the notation.)

For instance, GainRatio(Outlook) = 0.246/1.577 = 0.156. Similarly, the gain ratio

for the other attributes can be calculated. We leave it as an exercise to the reader to

see if Outlook will again be chosen to form the root decision test.

At this point in the discussion, it should be mentioned that decision trees cannot

model all decision boundaries between classes in a succinct manner. For instance,

although they can model any Boolean function, the resulting tree might be needlessly

complex. Consider, for instance, modeling an XOR over a large number of Boolean

attributes. In this case every attribute would need to be tested along every path and

the tree would be exponential in size. Another example of a difﬁcult problem for

decision trees are so-called “m-of-n” functions where the class is predicted by any

m of n attributes, without being speciﬁc about which attributes should contribute to

the decision. Solutions such as oblique decision trees, presented later, overcome such

drawbacks. Besides this difﬁculty, a second problem with decision trees induced by

C4.5 is the duplication of subtrees due to the greedy choice of attribute selection.

Beyond an exhaustive search for the best attribute by fully growing the tree, this

problem is not solvable in general.

1.3 C4.5 Features 7

1.3 C4.5 Features

1.3.1 Tree Pruning

Tree pruning is necessary to avoid overﬁtting the data. To drive this point, Quinlan

gives a dramatic example in [30] of a dataset with 10 Boolean attributes, each of which

assumes values 0 or 1 with equal accuracy. The class values were also binary: “yes”

with probability 0.25 and “no” with probability 0.75. From a starting set of 1,000

instances, 500 were used for training and the remaining 500 were used for testing.

Quinlan observes that C4.5 produces a tree involving 119 nodes (!) with an error rate of

more than 35% when a simpler tree would have sufﬁced to achieve a greater accuracy.

Tree pruning is hence critical to improveaccuracyof the classiﬁer on unseen instances.

It is typically carried out after the tree is fully grown, and in a bottom-up manner.

The 1986 MIT AI lab memo authored by Quinlan [26] outlines the various choices

available for tree pruning in the context of past research. The CART algorithm uses

what is known as cost-complexity pruning where a series of trees are grown, each

obtained from the previous by replacing one or more subtrees with a leaf. The last

tree in the series comprises just a single leaf that predicts a speciﬁc class. The cost-

complexity is a metric that decides which subtrees should be replaced by a leaf

predicting the best class value. Each of the trees are then evaluated on a separate

test dataset, and based on reliability measures derived from performance on the test

dataset, a “best” tree is selected.

Reduced error pruning is a simpliﬁcation of this approach. As before, it uses a

separate test dataset but it directly uses the fully induced tree to classify instances in

the test dataset. For every nonleaf subtree in the induced tree, this strategy evaluates

whether it is beneﬁcial to replace the subtree by the best possible leaf. If the pruned tree

would indeed give an equal or smaller number of errors than the unpruned tree and the

replaced subtree does not itself contain another subtree with the same property, then

the subtree is replaced. This process is continued until further replacements actually

increase the error over the test dataset.

Pessimistic pruning is an innovation in C4.5 that does not require a separate test set.

Rather it estimates the error that might occur based on the amount of misclassiﬁcations

in the training set. This approach recursively estimates the error rate associated with

a node based on the estimated error rates of its branches. For a leaf with N instances

and E errors (i.e., the number of instances that do not belong to the class predicted

by that leaf), pessimistic pruning ﬁrst determines the empirical error rate at the leaf

as the ratio (E +0.5)/N . For a subtree with L leaves and E and  N corresponding

errors and number of instances over these leaves, the error rate for the entire subtree

is estimated to be ( E +0.5 ∗ L)/N. Now, assume that the subtree is replaced by

its best leaf and that J is the number of cases from the training set that it misclassiﬁes.

Pessimistic pruning replaces the subtree with this best leaf if (J +0.5) is within one

standard deviation of (E + 0.5 ∗ L).

This approach can be extended to prune based on desired conﬁdence intervals (CIs).

We can model the error rates e at the leaves as Bernoulli random variables and for

剩余205页未读，继续阅读

loadbalancing

粉丝: 2
资源: 19

数据挖掘十大算法详解 - CRC (2009)

The Top Ten Algorithms in Data Mining

The Top Ten Algorithms in Data Mining 2009 - X. Wu & V. Kumar -

tell me about how to reprocess data in machine learning

数据结构与算法c语言思维导图

python数据结构与算法练习书籍推荐

queue python

coursera Self-Driving Cars

fast maximum likelihood

scaler = StandardScaler() scaler.fit(pc_data[feature]) data_ori_nor = scaler.transform(pc_data[feature])

Genetic Algorithms in Search, Optimization and Machine Learning

hands-on machine learning with scikit-learn, keras & tensorflow

ppandas.DataFrame

已知文本”The more the data, the better the performance of machine learning algorithms.”。统计文本中每个单词出现的次数

Sentinel-1 IPF

lihang_algorithms/data/train_binary.csv

最新资源