决策树ID3与C4.5算法比较研究

需积分: 9 87 浏览量更新于2024-09-08 收藏 621KB PDF 举报

"这篇论文探讨了决策树方法在数据挖掘中的应用，特别是ID3和C4.5算法。这两种算法由J.R. Quinlan提出，用于构建合理的决策树模型。文章旨在详细介绍这两种算法，并对它们与其他算法如C5.0和CART进行比较。" ID3（Iterative Dichotomiser 3）算法是最早的基于信息熵的决策树构建算法之一。它以信息增益作为特征选择的标准，旨在通过分裂数据集来最大化信息熵的减少，从而创建一个能够预测目标变量的分层结构。ID3的主要优点在于其简单易懂，但存在几个显著的局限性：它只适用于离散属性，无法处理连续特征；并且在处理大量类别时可能产生过深的决策树，导致过拟合。 C4.5是ID3的改进版本，克服了ID3的一些缺点。C4.5不仅处理离散数据，还能处理连续数据，通过信息增益率而非纯信息增益来选择最优划分属性，这减少了偏向于选择具有更多值的属性的倾向。此外，C4.5引入了剪枝策略来防止过拟合，提高了决策树的泛化能力。C4.5还支持缺失值处理，允许在数据不完整的情况下构建决策树。论文中提到的C5.0是C4.5的后续版本，进一步优化了算法效率，内存使用和处理速度都有所提升。而CART（Classification and Regression Trees）算法则主要关注构建二叉树，同时可以用于分类和回归问题，其依据的是基尼不纯度或Gini指数来进行特征选择。比较ID3、C4.5、C5.0和CART，这些算法在不同的数据集和问题上可能表现出不同的性能。例如，C4.5和C5.0在处理连续数据和处理缺失值方面比ID3更强大，而CART则因为其二叉树结构在某些情况下可能更有效率。选择哪种算法通常取决于具体的数据特性和应用场景。决策树算法在数据挖掘中扮演着重要角色，它们提供了一种直观且解释性强的模型，适合用户理解和决策。ID3和C4.5是这一领域的经典代表，随着时间的发展，通过不断改进，衍生出了更多高效且适应性强的变体。

(IJACSA) International Journal of Advanced Computer Science and Applications,

Special Issue on Advances in Vehicular Ad Hoc Networking and Applications

13 | P a g e

www.ijacsa.thesai.org

A comparative study of decision tree ID3 and C4.5

Badr HSSINA, Abdelkarim MERBOUHA,Hanane EZZIKOURI,Mohammed ERRITALI

TIAD laboratory, Computer Sciences Department, Faculty of sciences and techniques

Sultan Moulay Slimane University

Beni-Mellal, BP: 523, Morocco

Abstract—Data mining is the useful tool to discovering the

knowledge from large data. Different methods & algorithms

are available in data mining. Classification is most common

method used for finding the mine rule from the large database.

Decision tree method generally used for the Classification,

because it is the simple hierarchical structure for the user

understanding & decision making. Various data mining

algorithms available for classification based on Artificial

Neural Network, Nearest Neighbour Rule & Baysen classifiers

but decision tree mining is simple one. ID3 and C4.5

algorithms have been introduced by J.R Quinlan which produce

reasonable decision trees. The objective of this paper is to present

these algorithms. At first we present the classical algorithm that

is ID3, then highlights of this study we will discuss in more detail

C4.5 this one is a natural extension of the ID3 algorithm. And we

will make a comparison between these two algorithms and others

algorithms such as C5.0 and CART.

Keywords—Data mining; classification algorithm; decision

tree; ID3 algorithme; C4.5 algorithme

I. INTRODUCTION

The construction of decision trees from data is a

longstanding discipline. Statisticians attribute the paternity to

Sonquist and Morgan (1963) [4] who used regression trees in

the process of prediction and explanation (AID - Automatic

Interaction Detection). It was followed by a whole family of

method, extended to the problems of discrimination and

classification, which were based on the same paradigm of

representation trees (Thaid - Morgan and Messenger, 1973;

CHAID - Kass, 1980). It is generally considered that this

approach has culminated in the CART (Classification and

Regression Tree ) method of Breiman et al. (1984 ) described

in detail in a monograph refers today. [4]

In machine learning, most studies are based on information

theory. It is customary to quote the ID3 Quinlan method

(Induction of Decision Tree - Quinlan 1979), which itself

relates his work to that of Hunt (1962) [4]. Quinlan has been a

very active player in the second half of the 80s with a large

number of publications in which he proposes a heuristics to

improve the behavior of the system. His approach has made a

significant turning point in the 90s when he presented the C4.5

method which is the other essential reference when we want to

include decision trees (1993). There are many other changes

this algorithm, C5.0, but is implemented in a commercial

software.

Classification methods aim to identify the classes that

belong objects from some descriptive traits. They find utility in

a wide range of human activities and particularly in automated

decision making.

Decision trees are a very effective method of supervised

learning. It aims is the partition of a dataset into groups as

homogeneous as possible in terms of the variable to be

predicted. It takes as input a set of classified data, and outputs a

tree that resembles to an orientation diagram where each end

node (leaf) is a decision (a class) and each non- final node

(internal) represents a test. Each leaf represents the decision of

belonging to a class of data verifying all tests path from the

root to the leaf.

The tree is simpler, and technically it seems easy to use. In

fact, it is more interesting to get a tree that is adapted to the

probabilities of variables to be tested. Mostly balanced tree will

be a good result. If a sub-tree can only lead to a unique

solution, then all sub-tree can be reduced to the simple

conclusion, this simplifies the process and does not change the

final result. Ross Quinlan worked on this kind of decision trees.

II. INFORMATION THEORY

Theories of Shannon is at the base of the ID3 algorithm and

thus C4.5. Entropy Shannon is the best known and most

applied. It first defines the amount of information provided by

an event: the higher the probability of an event is low (it is

rare), the more information it provides is great. [2] (In the

following all logarithms are base2).

A. Shannon Entropy

In general, if we are given a probability distribution P = (p

,…, p

) and a sample S then the Information carried by this

distribution, also called the entropy of P is giving by:

B. The gain information G (p, T)

We have functions that allow us to measure the degree of

mixing of classes for all sample and therefore any position of

the tree in construction. It remains to define a function to select

the test that must label the current node.

It defines the gain for a test T and a position p

where values (p

) is the set of all possible values for

attribute T. We can use this measure to rank attributes and

build the decision tree where at each node is located the

(1)

(2)

下载后可阅读完整内容，剩余6页未读，立即下载

weilu0324

粉丝: 1
资源: 7

决策树ID3与C4.5算法比较研究

C4_5.zip_c4.5_decision tree_决策树c4.5_决策树算法c4.5_数据分类

jueceshu.rar_C4.5 Iris_decision tree id3_id3 c4.5 cart_决策树 ID3_

ID3算法源程序(C语言).rar_ID3 C语言_ID3 algorithm_decision tree_id3 c_二叉树

给我一些ID3算法和c4.5算法的参考文献

决策树id3、c4.5和cart pyhton代码

在jupyter实现 自己找数据实现ID3,C4.5,CART算法，生成三棵对应决策树。 要求 1、自己找数据，数据属性个数大于等于3，记录数大于等于20 2、python实现，代码需要保留注释

对于wine数据，在R软件中利用ID3算法和C4.5算法构建决策树并给出详细过程和决策树剪枝过程，以及绘制相关图片，给出详细的R代码，给出详细的步骤解读和结果解读

id3决策树 鸢尾花 python_C4.5决策树Python代码实现

使用sklearn.datasets中的make_blobs函数产生100个样本，每个样本具有两个特征，这两类样本的中心点分别为[2, 4]和[4, 2]，标准差为1，使用决策树算法对这些样本进行分类，讲述分类过程和输出最终决策树数据；

最新资源

在jupyter实现自己找数据实现ID3,C4.5,CART算法，生成三棵对应决策树。要求 1、自己找数据，数据属性个数大于等于3，记录数大于等于20 2、python实现，代码需要保留注释

id3决策树鸢尾花 python_C4.5决策树Python代码实现