2001.RandomForests.MachineLearning

random

forests

4星 · 超过85%的资源需积分: 22 169 浏览量更新于2023-03-16 评论 2 收藏 121KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

RANDOM FORESTS

Leo Breiman

Statistics Department

University of California

Berkeley, CA 94720

September 1999

Abstract

Random forests are a combination of tree predictors

such that each tree depends on the values of a

random vector sampled independently and with the

same distribution for all trees in the forest. The

generalization error for forests converges a.s. to a

limit as the number of trees in the forest becomes

large. The generalization error of a forest of tree

classifiers depends on the strength of the individual

trees in the forest and the correlation between them.

Using a random selection of features to split each

node yields error rates that compare favorably to

Adaboost (Freund and Schapire[1996]), but are more

robust with respect to noise. Internal estimates

monitor error, strength, and correlation and these are

used to show the response to increasing the number

of features used in the splitting. Internal estimates

are also used to measure variable importance. These

ideas are also applicable to regression.

Random Forests

1.1

Introduction

Significant improvements in classification accuracy have resulted from

growing an ensemble of trees and letting them vote for the most popular

class. In order to grow these ensembles, often random vectors are generated

that govern the growth of each tree in the ensemble. An early example is

bagging (Breiman [1996]), where to grow each tree a random selection

(without replacement) is made from the examples in the training set.

Another example is random split selection (Dietterich [1998]) where at each

node the split is selected at random from among the K best splits. Breiman

[1999] generates new training sets by randomizing the outputs in the original

training set. Another approach is to select the training set from a random set

of weights on the examples in the training set. Ho [1998] has written a

number of papers on "the random subspace" method which does a random

selection of a subset of features to use to grow each tree.

In an important paper on written character recognition, Amit and Geman

[1997] define a large number of geometric features and search over a random

selection of these for the best split at each node. This latter paper has been

influential in my thinking.

The common element in all of these procedures is that for the kth tree, a

random vector Θ

is generated, independent of the past random vectors

1, ... ,

k−1

but with the same distribution; and a tree is grown using the

training set and Θ

, resulting in a classifier h(x,Θ

) where x is an input

vector. For instance, in bagging the random vector Θ is generated as the

counts in N boxes resulting from N darts thrown at random at the boxes,

where N is number of examples in the training set. In random split selection

Θ consists of a number of independent random integers between 1 and K.

The nature and dimensionality of Θ depends on its use in tree construction.

After a large number of trees is generated, they vote for the most popular

class. We call these procedures random forests.

Definition 1.1 A random forest is a classifier consisting of a collection of tree-

structured classifiers

{h(x,Θ

), k=1, ...} where the {Θ

} are independent

identically distributed random vectors and each tree casts a unit vote for the

most popular class at input x .

1.2 Outline of Paper

Section 2 gives some theoretical background for random forests. Use of the

Strong Law of Large Numbers shows that they always converge so that

overfitting is not a problem. We give a simplified and extended version of

the Amit and Geman [1997] analysis to show that the accuracy of a random

forest depends on the strength of the individual tree classifiers and a measure

of the dependence between them (see Section 2 for definitions).

Section 3 introduces forests using the random selection of features at each

node to determine the split. An important question is how many features

to select at each node. For guidance, internal estimates of the generalization

error, classifier strength and dependence are computed. These are called out-

of-bag estimates and are reviewed in Section 4. Sections 5 and 6 give

empirical results for two different forms of random features. The first uses

random selection from the original inputs; the second uses random linear

combinations of inputs. The results compare favorably to Adaboost.

The results turn out to be insensitive to the number of features selected to

split each node. Usually, selecting one or two features gives near optimum

results. To explore this and relate it to strength and correlation, an empirical

study is carried out in Section 7.

Adaboost has no random elements and grows an ensemble of trees by

successive reweightings of the training set where the current weights depend

on the past history of the ensemble formation. But just as a deterministic

random number generator can give a good imitation of randomness, my

belief is that in its later stages Adaboost is emulating a random forest.

Evidence for this conjecture is given in Section 8.

Important recent problems, i.e.. medical diagnosis and document retrieval ,

often have the property that there are many input variables, often in the

hundreds or thousands, with each one containing only a small amount of

information. A single tree classifier will then have accuracy only slightly

better than a random choice of class. But combining trees grown using

random features can produce improved accuracy. In Section 9 we experiment

on a simulated data set with 1,000 input variables, 1,000 examples in the

training set and a 4,000 example test set. Accuracy comparable to the Bayes

rate is achieved.

In many applications, understanding of the mechanism of the random forest

"black box" is needed. Section 10 makes a start on this by computing internal

estimates of variable importance and binding these together by reuse runs.

Section 11 looks at random forests for regression. A bound for the mean

squared generalization error is derived that shows that the decrease in error

剩余34页未读，继续阅读

wokao9527

2015-06-07

正在学习RF，够清晰

shenxiaolu1984

粉丝: 2213
资源: 6

会员权益专享

2001.Random Forests.Machine Learning

评论1

会员权益专享

最新资源

2001.Random Forests.Machine Learning

评论1

Random forests(Leo Breiman)

Machine Learning With Random Forests And Decision Trees - A Visual Guide

Random Forests-LEO BREIMAN

请给出介绍随机森林算法原理的参考文献

列举出与支持向量机、随机森林、XGBoost有关的文献

Machine learning knowledge outline

python random forests

随机森林算法参考文献

machine learning algorithm

python orion-ml

随机森林算法的基本介绍以及使用的语言环境介绍 2、算法的运行举例（截图或者图表）以及性能比较 3、算法的改进、变种以及其解决了什么具体的现实问题 要求：5篇参考文献

matlab进行多数据分类

什么是hand-crafted features

Random Forests

神经网络与量子计算的交叉研究.pptx

非线性端口 MEMS 麦克风的 Simscape 模型.zip

用于超声成像和仿真的 MATLAB 工具箱.zip

HFI高频注入仿真—matlab.zip

北京工商大学上网登陆版源码.zip

会员权益专享

最新资源

随机森林算法的基本介绍以及使用的语言环境介绍 2、算法的运行举例（截图或者图表）以及性能比较 3、算法的改进、变种以及其解决了什么具体的现实问题要求：5篇参考文献