数据挖掘2009年十大算法详解 - X. Wu & V. Kumar

需积分: 9 183 浏览量更新于2024-08-01 收藏 5.95MB PDF 举报

"The Top Ten Algorithms in Data Mining 2009 - X. Wu & V. Kumar -" 在数据挖掘领域，算法的选择对于解决问题至关重要。2009年，X. Wu 和 V. Kumar 提出了一份关于数据挖掘领域的十大算法列表。这些算法在学术界和工业界都具有广泛的影响，是理解和应用数据挖掘技术的基础。以下是对这十大算法的详细解释： 1. **Apriori**：Apriori 算法是一种关联规则学习算法，用于发现数据库中项集之间的频繁模式。它通过迭代的方式生成候选集并进行支持度计算，有效地避免了对全数据库的扫描。 2. **ID3 (Iterative Dichotomiser 3)**：ID3 是决策树学习的经典算法，基于信息熵和信息增益来选择最佳划分属性，用于分类任务。 3. **C4.5**：C4.5 是 ID3 的改进版本，解决了 ID3 中的一些问题，如处理连续属性和类别不平衡。它使用信息增益比作为分裂标准，并能处理缺失值。 4. **K-Nearest Neighbors (KNN)**：KNN 是一种基于实例的学习方法，用于分类和回归。它根据最近邻的距离（通常是欧氏距离）将新样本分配到最接近的多数类。 5. **Naive Bayes**：朴素贝叶斯算法基于贝叶斯定理，假设特征之间相互独立，用于概率分类。尽管其“朴素”假设可能过于简化，但在许多实际问题中仍表现出良好的性能。 6. **SVM (Support Vector Machines)**：支持向量机是一种监督学习模型，通过构造最大边距超平面来分离数据。SVM 在高维空间中的分类效果尤为出色，并可以应用于非线性问题。 7. ** CART (Classification and Regression Trees)**：CART 生成二叉决策树，不仅用于分类，还可用于回归任务。它通过最小化不纯度或Gini指数来选择最优分割点。 8. **EM (Expectation-Maximization)**：EM 算法是一种用于估计混合模型参数的迭代方法，如高斯混合模型。它通过期望步骤和最大化步骤交替更新参数，直至收敛。 9. **PageRank**：PageRank 是谷歌搜索引擎的核心算法，用于评估网页的重要性。它通过模拟随机浏览网络的行为来确定网页的排名。 10. **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**：DBSCAN 是一种基于密度的空间聚类算法，可以发现任意形状的聚类，并对噪声有很好的容忍度。这些算法构成了数据挖掘的基础工具箱，它们各自适用于不同的问题和数据类型。了解并熟练运用这些算法对于数据科学家来说至关重要，能够帮助他们更好地从海量数据中提取有价值的信息。同时，随着数据科学的发展，新的算法不断涌现，但这些经典的算法依然保持着重要的地位。

4 C4.5

Yes

No No

Outlook

Humidity

Windy

Sunny Rainy

Overcast

>75<=75 FalseTrue

Figure 1.2 Decision tree induced by C4.5 for the dataset of Figure 1.1.

Figure 1.1 presents the classical “golf” dataset, which is bundled with the C4.5

installation. As stated earlier, the goal is to predict whether the weather conditions

on a particular day are conducive to playing golf. Recall that some of the features are

continuous-valued while others are categorical.

Figure 1.2 illustrates the tree induced by C4.5 using Figure 1.1 as training data

(and the default options). Let us look at the various choices involved in inducing such

trees from the data.

What types of tests are possible? As Figure 1.2 shows, C4.5 is not restricted

to considering binary tests, and allows tests with two or more outcomes. If the

attribute is Boolean, the test induces two branches. If the attribute is categorical,

the test is multivalued, but different values can be grouped into a smaller set of

options with one class predicted for each option. If the attribute is numerical,

then the tests are again binary-valued, and of the form {≤ θ?,> θ?}, where θ

is a suitably determined threshold for that attribute.

How are tests chosen? C4.5 uses information-theoretic criteria such as gain

(reduction in entropy of the class distribution due to applying a test) and

gain ratio (a way to correct for the tendency of gain to favor tests with many

outcomes). The default criterion is gain ratio. At each point in the tree-growing,

the test with the best criteria is greedily chosen.

How are test thresholds chosen? As stated earlier, for Boolean and categorical

attributes, the test values are simply the different possible instantiations of that

attribute. For numerical attributes, the threshold is obtained by sorting on that

attribute and choosing the split between successive values that maximize the

criteria above. Fayyad and Irani [10] showed that not all successive values need

to be considered. For two successive values v

and v

i+1

of a continuous-valued

1.2 Algorithm Description 5

attribute, if all instances involving v

and all instances involving v

i+1

belong to

the same class, then splitting between them cannot possibly improve informa-

tion gain (or gain ratio).

How is tree-growing terminated? A branch from a node is declared to lead

to a leaf if all instances that are covered by that branch are pure. Another way

in which tree-growing is terminated is if the number of instances falls below a

speciﬁed threshold.

Howareclass labels assigned to the leaves? The majority class of the instances

assigned to the leaf is taken to be the class prediction of that subbranch of the

tree.

The above questions are faced by any classiﬁcation approach modeled after trees and

similar, or other reasonable, decisions are made by most tree induction algorithms.

The practical utility of C4.5, however, comes from the next set of features that build

upon the basic tree induction algorithm above. But before we present these features,

it is instructive to instantiate Algorithm 1.1 for a simple dataset such as shown in

Figure 1.1.

We will work out in some detail how the tree of Figure 1.2 is induced from

Figure 1.1. Observe how the ﬁrst attribute chosen for a decision test is the Outlook

attribute. To see why, let us ﬁrst estimate the entropy of the class random variable

(PlayGolf?). This variable takes two values with probability 9/14 (for “Yes”) and

5/14 (for “No”). The entropy of a class random variable that takes on c values with

probabilities p

, p

,...,p

is given by:



i=1

−p

log

The entropy of PlayGolf? is thus

−(9/14) log

(9/14) − (5/14) log

(5/14)

or 0.940. This means that on average 0.940 bits must be transmitted to communicate

information about the PlayGolf? random variable. The goal of C4.5 tree induction is

to ask the right questions so that this entropy is reduced. We consider each attribute in

turn to assess the improvement in entropy that it affords. For a given random variable,

say Outlook, the improvement in entropy, represented as Gain(Outlook), is calculated

as:

Entropy(PlayGolf? in D) −



|D|

Entropy(PlayGolf? in D

)

where v is the set of possible values (in this case, three values for Outlook), D denotes

the entire dataset, D

is the subset of the dataset for which attribute Outlook has that

value, and the notation |·|denotes the size of a dataset (in the number of instances).

This calculation will show that Gain(Outlook) is 0.940−0.694 = 0.246. Similarly,

we can calculate that Gain(Windy) is 0.940 −0.892 = 0.048. Working out the above

calculations for the other attributes systematically will reveal that Outlook is indeed

6 C4.5

the best attribute to branch on. Observe that this is a greedy choice and does not take

into account the effectof future decisions. As stated earlier, the tree-growingcontinues

till termination criteria such as purity of subdatasets are met. In the above example,

branching on the value “Overcast” for Outlook results in a pure dataset, that is, all

instances having this value for Outlook have the value “Yes” for the class variable

PlayGolf?; hence, the tree is not grownfurther in that direction. However, the other two

values for Outlook still induce impure datasets. Therefore the algorithm recurses, but

observe that Outlook cannot be chosen again (why?). For different branches, different

test criteria and splits are chosen, although, in general, duplication of subtrees can

possibly occur for other datasets.

We mentioned earlier that the default splitting criterion is actually the gain ratio, not

the gain. To understand the difference, assume we treated the Day column in Figure 1.1

as if it were a “real” feature. Furthermore, assume that we treat it as a nominal valued

attribute. Of course, each day is unique, so Day is really not a useful attribute to

branch on. Nevertheless, because there are 14 distinct values for Day and each of

them induces a “pure” dataset (a trivial dataset involving only one instance), Day

would be unfairly selected as the best attribute to branch on. Because information

gain favors attributes that contain a large number of values, Quinlan proposed the

gain ratio as a correction to account for this effect. The gain ratio for an attribute a is

deﬁned as:

GainRatio(a) =

Gain(a)

Entropy(a)

Observe that entropy(a) does not depend on the class information and simply takes

into account the distribution of possible values for attribute a, whereas gain(a) does

take into account the class information. (Also, recall that all calculations here are

dependent on the dataset used, although we haven’t made this explicit in the notation.)

For instance, GainRatio(Outlook) = 0.246/1.577 = 0.156. Similarly, the gain ratio

for the other attributes can be calculated. We leave it as an exercise to the reader to

see if Outlook will again be chosen to form the root decision test.

At this point in the discussion, it should be mentioned that decision trees cannot

model all decision boundaries between classes in a succinct manner. For instance,

although they can model any Boolean function, the resulting tree might be needlessly

complex. Consider, for instance, modeling an XOR over a large number of Boolean

attributes. In this case every attribute would need to be tested along every path and

the tree would be exponential in size. Another example of a difﬁcult problem for

decision trees are so-called “m-of-n” functions where the class is predicted by any

m of n attributes, without being speciﬁc about which attributes should contribute to

the decision. Solutions such as oblique decision trees, presented later, overcome such

drawbacks. Besides this difﬁculty, a second problem with decision trees induced by

C4.5 is the duplication of subtrees due to the greedy choice of attribute selection.

Beyond an exhaustive search for the best attribute by fully growing the tree, this

problem is not solvable in general.

1.3 C4.5 Features 7

1.3 C4.5 Features

1.3.1 Tree Pruning

Tree pruning is necessary to avoid overﬁtting the data. To drive this point, Quinlan

gives a dramatic example in [30] of a dataset with 10 Boolean attributes, each of which

assumes values 0 or 1 with equal accuracy. The class values were also binary: “yes”

with probability 0.25 and “no” with probability 0.75. From a starting set of 1,000

instances, 500 were used for training and the remaining 500 were used for testing.

Quinlan observes that C4.5 produces a tree involving 119 nodes (!) with an error rate of

more than 35% when a simpler tree would have sufﬁced to achieve a greater accuracy.

Tree pruning is hence critical to improveaccuracyof the classiﬁer on unseen instances.

It is typically carried out after the tree is fully grown, and in a bottom-up manner.

The 1986 MIT AI lab memo authored by Quinlan [26] outlines the various choices

available for tree pruning in the context of past research. The CART algorithm uses

what is known as cost-complexity pruning where a series of trees are grown, each

obtained from the previous by replacing one or more subtrees with a leaf. The last

tree in the series comprises just a single leaf that predicts a speciﬁc class. The cost-

complexity is a metric that decides which subtrees should be replaced by a leaf

predicting the best class value. Each of the trees are then evaluated on a separate

test dataset, and based on reliability measures derived from performance on the test

dataset, a “best” tree is selected.

Reduced error pruning is a simpliﬁcation of this approach. As before, it uses a

separate test dataset but it directly uses the fully induced tree to classify instances in

the test dataset. For every nonleaf subtree in the induced tree, this strategy evaluates

whether it is beneﬁcial to replace the subtree by the best possible leaf. If the pruned tree

would indeed give an equal or smaller number of errors than the unpruned tree and the

replaced subtree does not itself contain another subtree with the same property, then

the subtree is replaced. This process is continued until further replacements actually

increase the error over the test dataset.

Pessimistic pruning is an innovation in C4.5 that does not require a separate test set.

Rather it estimates the error that might occur based on the amount of misclassiﬁcations

in the training set. This approach recursively estimates the error rate associated with

a node based on the estimated error rates of its branches. For a leaf with N instances

and E errors (i.e., the number of instances that do not belong to the class predicted

by that leaf), pessimistic pruning ﬁrst determines the empirical error rate at the leaf

as the ratio (E +0.5)/N . For a subtree with L leaves and E and  N corresponding

errors and number of instances over these leaves, the error rate for the entire subtree

is estimated to be ( E +0.5 ∗ L)/N. Now, assume that the subtree is replaced by

its best leaf and that J is the number of cases from the training set that it misclassiﬁes.

Pessimistic pruning replaces the subtree with this best leaf if (J +0.5) is within one

standard deviation of (E + 0.5 ∗ L).

This approach can be extended to prune based on desired conﬁdence intervals (CIs).

We can model the error rates e at the leaves as Bernoulli random variables and for

8 C4.5

Leaf predicting

most likely class

Figure 1.3 Different choices in pruning decision trees. The tree on the left can be

retained as it is or replaced by just one of its subtrees or by a single leaf.

a given conﬁdence threshold CI, an upper bound e

max

can be determined such that

e < e

max

with probability 1 − CI. (C4.5 uses a default CI of 0.25.) We can go even

further and approximate e by the normal distribution (for large N), in which case

C4.5 determines an upper bound on the expected error as:

e +

+ z



−

1 +

(1.1)

where z is chosen based on the desired conﬁdence interval for the estimation, assuming

a normal random variable with zero mean and unit variance, that is, N(0, 1)).

What remains to be presented is the exact way in which the pruning is performed.

A single bottom-up pass is performed. Consider Figure 1.3, which depicts the pruning

process midway so that pruning has already been performed on subtrees T

, T

, and

. The error rates are estimated for three cases as shown in Figure 1.3 (right). The

ﬁrst case is to keep the tree as it is. The second case is to retain only the subtree

corresponding to the most frequent outcome of X (in this case, the middle branch).

The third case is to just have a leaf labeled with the most frequent class in the training

dataset. These considerations are continued bottom-up till we reach the root of the tree.

1.3.2 Improved Use of Continuous Attributes

More sophisticated capabilities for handling continuous attributes are covered by

Quinlan in [31]. These are motivated by the advantage shared by continuous-valued

attributes over discrete ones, namely that they can branch on more decision criteria

which might give them an unfair advantage over discrete attributes. One approach, of

course, is to use the gain ratio in place of the gain as before. However, we run into a

conundrum here because the gain ratio will also be inﬂuenced by the actual threshold

used by the continuous-valued attribute. In particular, if the threshold apportions the

剩余204页未读，继续阅读

ALuya

粉丝: 36
资源: 5

数据挖掘2009年十大算法详解 - X. Wu & V. Kumar

The Top Ten Algorithms in Data Mining

The Top Ten Algorithms in Data Mining_Datamining_algorithms_

Top-10-algorithms-in-data-mining.rar_数值算法/人工智能_Others_

Algorithm-Data-Mining-Algorithms.zip

python-algorithms-mastering-basic-algorithms-in-the-python-language.9781430232377.53502

Algorithm-Python-and-Algorithms-and-Data-Structures.zip

Algorithms-and-Data-Structures-implemented-in-Go-f-Go.zip

Data-Structures-and-Algorithms-with-Python-Undergraduate-Topics-in-Computer-Science.pdf.pdf

Algorithm-Algorithms-and-Data-Structures-in-Ruby.zip

Algorithm-Data-Structures-and-Algorithms-in-Java-2nd-Edition-by-Robert-Lafore.zip

最新资源