Meinshausen
training data is used. In addition, only a random subset of predictor variables is considered
for splitpoint selection at each node. The size of the random subset, called mtry, is the
single tuning parameter of the algorithm, even though results are typically nearly optimal
over a wide range of this parameter. The value of mtry can be fine-tuned on the out-of-bag
samples. For regression, the prediction of random forests for a new data point X = x is the
averaged response of all trees. For details see Breiman (2001). The algorithm is somewhat
related to boosting (Schapire et al., 1998), with trees as learners. Yet, with random forests,
each tree is grown using the original observations of the response variable, while bo os ting
tries to fit the residuals after taking into ac count the prediction of previously generated
trees (Friedman et al., 2000).
Some Notation Following the notation of Breiman (2001), call θ the random param-
eter vector that determines how a tree is grown (e.g. which variables are considered for
splitpoints at each node). The corresponding tree is denoted by T (θ). Let B be the space
in which X lives, that is X : Ω 7→ B ⊆ R
p
, where p ∈ N
+
is the dimensionality of the
predictor variable. Every leaf ` = 1, . . . , L of a tree corresponds to a rectangular subspace
of B. Denote this rectangular subspace by R
`
⊆ B for every leaf ` = 1, . . . , L. For every
x ∈ B, there is one and only one leaf ` such that x ∈ R
`
(corresponding to the leaf that is
obtained when dropping x down the tree). Denote this leaf by `(x, θ) for tree T (θ).
The prediction of a single tree T (θ) for a new data point X = x is obtained by averaging
over the observed values in leaf `(x, θ). Let the weight vector w
i
(x, θ) be given by a positive
constant if observation X
i
is part of leaf `(x, θ) and 0 if it is not. The weights sum to one,
and thus
w
i
(x, θ) =
1
{X
i
∈R
`(x,θ)
}
#{j : X
j
∈ R
`(x,θ)
}
. (4)
The prediction of a single tree, given covariate X = x, is then the weighted average of the
original observations Y
i
, i = 1, . . . , n,
single tree: ˆµ(x) =
n
X
i=1
w
i
(x, θ) Y
i
.
Using random forests, the conditional mean E(Y |X = x) is approximated by the averaged
prediction of k single trees, each constructed with an i.i.d. vector θ
t
, t = 1, . . . , k. Let w
i
(x)
be the average of w
i
(θ) over this collection of trees,
w
i
(x) = k
−1
k
X
t=1
w
i
(x, θ
t
). (5)
The prediction of random forests is then
Random Forests: ˆµ(x) =
n
X
i=1
w
i
(x)Y
i
.
The approximation of the conditional mean of Y , given X = x, is thus given by a weighted
sum over all observations. The weights vary with the covariate X = x and tend to be large
for those i ∈ {1, . . . , n } where the conditional distribution of Y , given X = X
i
, is similar to
the conditional distribution of Y , given X = x (Lin and Jeon, 2002).
986