6 C4.5
the best attribute to branch on. Observe that this is a greedy choice and does not take
into account the effectof future decisions. As stated earlier, the tree-growingcontinues
till termination criteria such as purity of subdatasets are met. In the above example,
branching on the value “Overcast” for Outlook results in a pure dataset, that is, all
instances having this value for Outlook have the value “Yes” for the class variable
PlayGolf?; hence, the tree is not grownfurther in that direction. However, the other two
values for Outlook still induce impure datasets. Therefore the algorithm recurses, but
observe that Outlook cannot be chosen again (why?). For different branches, different
test criteria and splits are chosen, although, in general, duplication of subtrees can
possibly occur for other datasets.
We mentioned earlier that the default splitting criterion is actually the gain ratio, not
the gain. To understand the difference, assume we treated the Day column in Figure 1.1
as if it were a “real” feature. Furthermore, assume that we treat it as a nominal valued
attribute. Of course, each day is unique, so Day is really not a useful attribute to
branch on. Nevertheless, because there are 14 distinct values for Day and each of
them induces a “pure” dataset (a trivial dataset involving only one instance), Day
would be unfairly selected as the best attribute to branch on. Because information
gain favors attributes that contain a large number of values, Quinlan proposed the
gain ratio as a correction to account for this effect. The gain ratio for an attribute a is
defined as:
GainRatio(a) =
Gain(a)
Entropy(a)
Observe that entropy(a) does not depend on the class information and simply takes
into account the distribution of possible values for attribute a, whereas gain(a) does
take into account the class information. (Also, recall that all calculations here are
dependent on the dataset used, although we haven’t made this explicit in the notation.)
For instance, GainRatio(Outlook) = 0.246/1.577 = 0.156. Similarly, the gain ratio
for the other attributes can be calculated. We leave it as an exercise to the reader to
see if Outlook will again be chosen to form the root decision test.
At this point in the discussion, it should be mentioned that decision trees cannot
model all decision boundaries between classes in a succinct manner. For instance,
although they can model any Boolean function, the resulting tree might be needlessly
complex. Consider, for instance, modeling an XOR over a large number of Boolean
attributes. In this case every attribute would need to be tested along every path and
the tree would be exponential in size. Another example of a difficult problem for
decision trees are so-called “m-of-n” functions where the class is predicted by any
m of n attributes, without being specific about which attributes should contribute to
the decision. Solutions such as oblique decision trees, presented later, overcome such
drawbacks. Besides this difficulty, a second problem with decision trees induced by
C4.5 is the duplication of subtrees due to the greedy choice of attribute selection.
Beyond an exhaustive search for the best attribute by fully growing the tree, this
problem is not solvable in general.
© 2009 by Taylor & Francis Group, LLC