MiCS：大数据互信息计算与特征分组的并行系统

162 浏览量更新于2024-08-28 收藏 719KB PDF 举报

"大分类数据互信息的计算及其在特征分组中的应用" 在大数据分析领域，特征选择和特征分组是至关重要的步骤，能够帮助提高模型的性能和理解性。本文关注的是大分类数据的互信息计算，这是一种衡量两个变量之间依赖性的统计量。互信息可以帮助识别哪些特征之间存在关联，从而在特征工程中进行有效的组合或分组。 MiCS（Mutual Information Calculation System）是为此目的设计的一个并行计算系统，它基于Apache Spark平台构建。Spark作为一种分布式计算框架，适合处理大规模数据，而MiCS的引入则进一步优化了在Spark上的互信息计算过程。MiCS的核心是一个逐列变换方案，该方案允许高效地处理大量特征对之间的互信息，同时保证计算的强可重复性，这意味着结果的稳定性。在计算过程中，数据偏斜是一个常见的问题，尤其是在大规模数据集上执行Shuffle操作时。数据偏斜可能导致某些节点负载过重，影响整体计算效率。为了解决这个问题，MiCS采用了一种虚拟分区方案。这种方案通过智能地划分和分布数据，确保工作负载在集群中的均衡分布，从而提高MiCS的效率和Spark集群资源的利用率。这不仅降低了计算延迟，还有助于避免由于数据不平衡导致的性能瓶颈。在特征分组的应用中，互信息计算可以用于识别那些具有高互信息的特征，这些特征可能共享相同的信息或者在预测目标变量时具有相似的影响力。通过将这些特征分组，可以减少模型的复杂性，降低过拟合风险，同时提高模型的解释性和泛化能力。此外，特征分组还可以加速学习过程，因为处理的特征数量减少，减少了计算负担。 MiCS提供了一种高效、可扩展的方法，用于处理大分类数据的互信息计算，这对于特征工程和机器学习模型的构建具有重要意义。通过解决数据偏斜问题和实现负载均衡，MiCS能够有效利用Spark的并行计算能力，为大数据环境下的特征选择和分组提供有力支持。

Computing Mutual Information of Big Categorical

Data and Its Application to Feature Grouping

Junli Li

School of Computer Science and Technology

Taiyuan University of Science and Technology

Taiyuan, China

1468375302@qq.com

Chaowei Zhang

Department of Computer Science and Software Engineering

Auburn University

Auburn, USA

czz0032@tigermail.auburn.edu

Jifu Zhang

School of Computer Science and Technology

Taiyuan University of Science and Technology

Taiyuan, China

jifuzh@sina.com

Xiao Qin

Department of Computer Science and Software Engineering

Auburn University

Auburn, USA

xqin@auburn.edu

Abstract—This paper develops a parallel computing system

- MiCS - for mutual information of big categorical data on

the Spark computing platform. The MiCS algorithm is conduc-

tive to processing a large amount and strong repeatability of

mutual-information calculation among feature pairs by applying

a column-wise transformation scheme. And to improve the

efﬁciency of the MiCS and the utilization rate of Spark cluster

resources, we adopt a virtual partitioning scheme to achieve

balanced load while mitigating the data skewness problem in

the Spark Shufﬂe process.

Index Terms—Parallel Mutual-information Computation, Fea-

ture Grouping, Data Skewness, Big Categorical Data, Spark.

I. INTRODUCTION

Mutual-information computation requires expensive calcu-

lations, which become a serious performance bottleneck in

processing big data. The overall objective of MiCS is to

alleviate the challenging performance problem in mutual-

information computation of big categorical data.

A. Observations

The inception of MiCS and its application is made possible

by the following three observations.

• Parallel mutual-information computation of categorical

data is critical in optimizing the performance of big-data

applications.

• The Spark platform is an ideal mode to deal with the

rapid development in the big-data arena.

• Uneven data distribution in mutual information compu-

tation can cause data skew, which leads to imbalanced

work load in Spark clusters.

Our proposed MiCS algorithm applies mutual information

as a measurement metric to quantify the similarities among

categorical features in feature grouping [1]. In order to im-

prove the performance of mutual- information computation,

an growing number of parallel-computing strategies are pro-

posed [2] [3] [4]. Importantly, these parallel algorithms fall

into the camps of GPU computing and multi-core computing.

Our MiCS fully utilizes the Spark computing platform to com-

pute mutual information, and a column-wise transformation

method is utilized to reduce the large amount and strong re-

peatability of mutual-information calculation between feature

pairs. For tackle data skewness in Spark, novel algorithms and

models [5] [6] [7] have been proposed in recent years. Our

MiCS algorithm explores virtual partitioning, which ensures

that large partitions are eliminated.

II. PARALLEL MUTUAL-INFORMATION COMPUTING

A. Mutual Information

Let DS be a big categorical dataset, and there are n data ob-

jects and m features in DS. Mutual information M I (y

: y

)

of two features y

and y

is deﬁned as

MI (y

; y

) =H (y

) + H (y

) − H (y

, y

)

k=1

l=1

= v

Λy

= v

)

× log

= v

Λy

= v

)

= v

) P

= v

)

(1)

where P

= v

Λy

= v

) denotes the probability that

features y

and y

equal to v

and v

(i.e., y

= v

and

= v

), respectively. d

and d

are the number of values

that feature y

and y

contain respectively. D (y

) and D (y

)

are the range of feature y

and y

, and D (y

) = {v

, · · ·v

idi

D (y

) = {v

, · · ·v

jdj

}. Mutual-information computation of

features MI (y

; y

) has to calculate the entropy and joint

entropy of feature y

and y

(see Line 1 in (1)).

B. Column-wise Transformation

MiCS ﬁrst splits original dataset DS into several feature

subsets to compute mutual information in parallel. Next, to

address the problem of repeatedly computing mutual infor-

mation in the iteration, we use two variable-length arrays to

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38669881

粉丝: 5
资源: 918

MiCS：大数据互信息计算与特征分组的并行系统

数字信号处理及其MATLAB实现1附件大分3个帖子-数字信号处理及其MATLAB实现3.rar

数字信号处理及其MATLAB实现1附件大分3个帖子-数字信号处理及其MATLAB实现2.rar

数字信号处理及其MATLAB实现1附件大分3个帖子-数字信号处理及其MATLAB实现1.rar

大分格玻璃在幕墙工程上的应用.pptx

精品（2021-2022年）资料高考物理二轮复习高考计算题54分练分类突破60分钟规范答题抓大分!.总结.doc

SaikoroOmikuji:日本Android协会大分分会ABC2015S应用程序竞赛资料库

大分石化企业联合体工资规程.docx

危大分部分项工程实施细则.pdf

11本区块链必看电子书(1)过大分俩

大分流前的中国江南原始工业化研究简述.docx

最新资源