Computing Mutual Information of Big Categorical
Data and Its Application to Feature Grouping
Junli Li
School of Computer Science and Technology
Taiyuan University of Science and Technology
Taiyuan, China
1468375302@qq.com
Chaowei Zhang
Department of Computer Science and Software Engineering
Auburn University
Auburn, USA
czz0032@tigermail.auburn.edu
Jifu Zhang
*
School of Computer Science and Technology
Taiyuan University of Science and Technology
Taiyuan, China
jifuzh@sina.com
Xiao Qin
Department of Computer Science and Software Engineering
Auburn University
Auburn, USA
xqin@auburn.edu
Abstract—This paper develops a parallel computing system
- MiCS - for mutual information of big categorical data on
the Spark computing platform. The MiCS algorithm is conduc-
tive to processing a large amount and strong repeatability of
mutual-information calculation among feature pairs by applying
a column-wise transformation scheme. And to improve the
efficiency of the MiCS and the utilization rate of Spark cluster
resources, we adopt a virtual partitioning scheme to achieve
balanced load while mitigating the data skewness problem in
the Spark Shuffle process.
Index Terms—Parallel Mutual-information Computation, Fea-
ture Grouping, Data Skewness, Big Categorical Data, Spark.
I. INTRODUCTION
Mutual-information computation requires expensive calcu-
lations, which become a serious performance bottleneck in
processing big data. The overall objective of MiCS is to
alleviate the challenging performance problem in mutual-
information computation of big categorical data.
A. Observations
The inception of MiCS and its application is made possible
by the following three observations.
• Parallel mutual-information computation of categorical
data is critical in optimizing the performance of big-data
applications.
• The Spark platform is an ideal mode to deal with the
rapid development in the big-data arena.
• Uneven data distribution in mutual information compu-
tation can cause data skew, which leads to imbalanced
work load in Spark clusters.
Our proposed MiCS algorithm applies mutual information
as a measurement metric to quantify the similarities among
categorical features in feature grouping [1]. In order to im-
prove the performance of mutual- information computation,
an growing number of parallel-computing strategies are pro-
posed [2] [3] [4]. Importantly, these parallel algorithms fall
into the camps of GPU computing and multi-core computing.
Our MiCS fully utilizes the Spark computing platform to com-
pute mutual information, and a column-wise transformation
method is utilized to reduce the large amount and strong re-
peatability of mutual-information calculation between feature
pairs. For tackle data skewness in Spark, novel algorithms and
models [5] [6] [7] have been proposed in recent years. Our
MiCS algorithm explores virtual partitioning, which ensures
that large partitions are eliminated.
II. PARALLEL MUTUAL-INFORMATION COMPUTING
A. Mutual Information
Let DS be a big categorical dataset, and there are n data ob-
jects and m features in DS. Mutual information M I (y
i
: y
j
)
of two features y
i
and y
j
is defined as
MI (y
i
; y
j
) =H (y
i
) + H (y
j
) − H (y
i
, y
j
)
=
d
i
X
k=1
d
j
X
l=1
P
ij
(y
i
= v
ik
Λy
j
= v
jl
)
× log
P
ij
(y
i
= v
ik
Λy
j
= v
jl
)
P
i
(y
i
= v
ik
) P
j
(y
j
= v
jl
)
.
(1)
where P
ij
(y
i
= v
ik
Λy
j
= v
jl
) denotes the probability that
features y
i
and y
j
equal to v
ik
and v
jl
(i.e., y
i
= v
ik
and
y
j
= v
jl
), respectively. d
i
and d
j
are the number of values
that feature y
i
and y
j
contain respectively. D (y
i
) and D (y
j
)
are the range of feature y
i
and y
j
, and D (y
i
) = {v
i1
, · · ·v
idi
},
D (y
j
) = {v
j1
, · · ·v
jdj
}. Mutual-information computation of
features MI (y
i
; y
j
) has to calculate the entropy and joint
entropy of feature y
i
and y
j
(see Line 1 in (1)).
B. Column-wise Transformation
MiCS first splits original dataset DS into several feature
subsets to compute mutual information in parallel. Next, to
address the problem of repeatedly computing mutual infor-
mation in the iteration, we use two variable-length arrays to