H. Wang, Y. Zhang and J. Zhang et al. / Information Sciences 480 (2019) 144–159 147
Table 1
Main notation.
Symbol Explanation
f Random variable of a feature
D A set of ordered pairs of features
G A grid constructed from all the feature pairs
w ( f
i
) The hidden variable in a factor graph
h The serial number of the selected feature
I A function that computes mutual information
MIC Maximal information coefficient [30]
S A function that computes similarity
M The number of features
E The energy function
be a set of ordered pairs of features. Furthermore, let the i - and j -values of F
P
be partitioned one-by-one into i and j bins,
respectively, and let a pair of partitions define an i -by- j grid G . Given such a grid G , let F
P
|
G
be the distribution induced by
the points in F
P
on the cells of G . That is, the distribution on the cells of G obtained by letting the probability mass in each
cell be the fraction of points in D falling in that cell. For a fixed F
P
, different grids G result in different distributions F
P
|
G
.
For a finite set F
P
⊂ R
2
and positive integers i, j ,
I
∗
(F
P
, i, j) = max I(F
P
|
G
) ,
where max is the maximum over all grids G with i columns and j rows, and I ( F
P
|
G
) denotes the mutual information of F
P
|
G
.
Then, the characteristic matrix and MIC of F
P
can be defined in terms of I
∗
. The characteristic matrix M ( F
P
) of a set F
P
of
two-variable data is an infinite matrix with entries
M(F
P
)
(i, j)
=
I
∗
(F
P
, i, j)
log min { i, j}
.
The MIC of a set F
P
of two-variable data with sample size M and a grid size less than B ( M ) is given by
MIC (F
P
) = max
ij<B (M)
M(F
P
)
(i, j)
,
where ω(1) < B (M) ≤ o(n
M
1 −ε
) for some 0 < ε < 1. The MIC falls between 0 and 1 and is symmetric, and higher values imply
greater relevance between features.
Brown et al. [2] presented a unifying framework for feature selection based on mutual information, which formulates the
feature selection task as a conditional likelihood problem. In the proposed algorithm, The MIC is the maximum value of the
mutual information matrix and is used to measure the similarity between features. However, all the methods mentioned
in [2] use mutual information to measure the relevance between features. The MIC finds the f
i
-by- f
j
grid with the highest
induced mutual information, and the mutual information scores are normalized. Then, the normalized scores form a matrix,
and the MIC is the highest score of the matrix. The MIC is calculated according to mutual information. However, it is the
highest normalized mutual information, and strengthens the relationship between the two features. It is more capable of
reflecting the dependence between two attributes and is used to evaluate the similarity of features, which can better reduce
the redundancy among features. However, owing to the higher complexity of the MIC, it requires more time than mutual
information to evaluate the similarities among features.
For unsupervised feature selection, the best selected subset should contain the lowest number of features that retain as
much of the original information as possible. Given a high dimensional dataset X = (x
1
, . . . , x
n
) in the instance space and
F = ( f
1
, . . . , f
M
) in the feature space, where x
i
∈ R and f
j
∈ R , let K be the number of selected features, and let
¯
F
= {
¯
f
1
, . . . ,
¯
f
K
}
denote the selected feature subset.
In terms of maintaining the original information (e.g., maximum mutual information), the feature selection objective
function can be expressed as follows:
(
¯
F
) = arg max
¯
F
(
M
j=1
K
i =1
MIC (
¯
f
i
, f
j
)) (1)
s.t
¯
f
i
= f
j
,
where K and M are the numbers of selected and total features, respectively, and
¯
F
is the subset of selected features.
Eq. (1) maximizes the MIC between the selected feature subset and whole feature set, which means that the selected feature
subset can preserve the maximum information of all feature sets. In other words, to a certain extent maximizing the MIC
can remove redundant features.
If an exhaustive search for the objective function in Eq. (1) with M features with N dimensions is employed and K
features are selected, then the computation complexity is O (NM! /K!(M − K)!) . Because this is fairly complex, in the next
section an effective selection approach based on the factor model is presented.