4
ℓ
1
SVM [10]. Other examples of the wrapper model could be any combination of a preferred
search strategy and given classifier. Since the wrapper model depends on a given classifi er ,
cross-validation is usually required in th e evaluation process. They are in general more com-
putationally expensive and biased to the chosen classifier. Therefore, in real applications,
the filter model is more popular, especially for problems with large datasets. However, the
wrapper model has been empirically pr oven to be superior, in terms of classification acc u-
racy, to those of a filter model. Due to these shortcomings in each model, the hybrid model
[13, 40], was proposed to bridge the gap between the fil te r and wrapper models. First, it
incorporates the statistical criteria, as filter model does, to select several candidate features
subsets with a given cardinality. Second, it chooses the subset with the highest classifica-
tion accuracy [40]. Thus, the hybrid model usually achieves both comparable accuracy to
the wrapper and comparable efficiency to the filter model. Representative feature selection
algorithms of hybrid model include: BBHFS [13], HGA [53]. Finally, the emb ed de d model
performs feature selection in the learning time. In other words, it achieves model fitting
and feature selection simultaneously. Ex ample s of embed de d model include C4.5 [54], Blo-
gReg [21], and SBMLR [21]. Based on different types of outputs, most feature selection
algorithms fall into one of the three categories: subset selection [75], which retu r n s a sub s et
of selected features identified by the index of the feature; feature weighting [59], which re-
turns weight corresponding to each feature; and the hybrid of subset sele ct ion and feature
weighting, which returns a ranked subset of features.
Feature weighting, on the other hand, is thought of as a generalization of feature selection
[70]. In feature select ion, a feature is assigned a binary weight, where 1 means the feature is
selected and 0 otherwise. However, feature weighting ass igns a value, usually in the interval
[0,1] or [-1,1], to each feature. The greater this value is, the more salient the featur e will be.
Feature weighting was found to outperform a feature sele cti on in tasks where features vary
in their relevance score [70], which is true in most real-world problems. Featur e weighting
could be, also, reduced to feature selection if a threshold is set to select features based on
their weights. Therefore, most of feature s e lec ti on algorithms mentioned in this chapter can
be considered using feature weighting scheme.
Typ icall y, a feature selection method consists of four basic steps [40], namely, subset
generation, subset evaluation, stopping criterion, and result validation. In the first ste p,
a candidate feature subset will be ch ose n based on a given search strategy, which is sent,
in the second step, to be evaluated according to certain evaluation criterion. The subset
that best fits the evaluation criterion will be chosen from all the candidates that have been
evaluated after the stopping criterion are met. In the final step, the chosen subset will be
validated using domain knowledge or validation set.
0.1.3 Feature Selection for Clustering
The e xi st en ce of irrelevant features in the data set may degrade learning quality and
consume more memory and computational time that could be saved if these features were
remove d. From the clustering point of view, r e moving irrelevant features will not negatively
affect clustering accuracy whilst reducing required storage and computational time. F i gur e
2 illustrates this notion where (a) shows the relevant feature f
1
which can discriminate
clusters. Figure 2(b) and (c) shows that f
2
and f
3
cannot distinguish the clusters; h en ce ,
they will not add any significant information to the clustering.
In addition, different relevant features may produce different clusteri ng. Figure 3(a)
shows four clus te r s by utilizing knowledge from f
1
and f
2
, while F i gur e 3(b) shows two
clusters if we use f
1
only. Similarly, Figure 3(c) shows two different clusters if we use f
2
instead.Therefore, different subset of relevant features may result in different clus t er in g,
which greatly help discovering differe nt hidden patterns in the data.