MICHAC：基于最大信息系数和层次聚类的特征选择在缺陷预测中的应用

需积分: 40 85 浏览量更新于2024-08-26 收藏 271KB PDF 举报

"MICHAC是一种基于最大信息系数和分层聚类的特征选择方法，用于软件缺陷预测。该方法旨在通过筛选和去除不相关及冗余特征，提高缺陷预测的准确性。MICHAC由两阶段组成：首先使用最大信息系数评估特征的相关性，然后通过分层聚类对特征进行聚类并选择代表性特征。通过在11个NASA项目和4个开源AEEEM项目上的实验，MICHAC展示出有效的特征选择能力，并在精度、召回率、F度量和AUC等指标上进行了评估。" MICHAC（Maximal Information Coefficient with Hierarchical Agglomerative Clustering for Defect Prediction）是针对软件缺陷预测问题提出的一种新方法。缺陷预测是软件工程中的关键任务，它依赖于从历史缺陷数据中提取的多种特征，如代码复杂性、变更频率等。然而，这些特征可能存在不相关或冗余的问题，影响预测模型的性能。 MICHAC的运作机制分为两个主要步骤。第一步，利用最大信息系数（Maximal Information Coefficient, MIC），这是一种衡量两个变量间关系强度的统计量，对所有候选特征进行排序。MIC能够检测非线性和复杂的关联，因此能有效地识别出与缺陷相关性强的特征，而将不相关的特征排除。第二步，MICHAC采用分层聚类（Hierarchical Agglomerative Clustering, HAC）对特征进行聚类。这种方法自底向上地组合相似的特征，形成一个层次结构。通过这种方式，可以识别出特征之间的冗余，每个聚类中选取一个代表性的特征，从而去除冗余，保留最能反映软件缺陷模式的特征集合。为了验证MICHAC的有效性，研究人员在11个NASA项目和4个开源AEEEM项目的数据集上进行了实验。他们使用了四种不同的分类器（未在摘要中具体指明），并评估了四种性能指标：精度、召回率、F1量度和曲线下面积（AUC）。通过对这些指标的分析，MICHAC与其他五种现有的特征选择方法进行了对比，结果表明MICHAC在特征选择方面表现出色，有助于提升软件缺陷预测的准确性和效率。总结来说，MICHAC是一个创新的特征选择框架，它结合了最大信息系数的关联分析能力和分层聚类的特征聚合能力，为软件缺陷预测提供了更高效、更准确的特征预处理方法。通过减少不相关和冗余特征，MICHAC可以优化模型性能，从而提高软件的可靠性评估。这对于软件开发团队而言，具有重要的实践意义，能够帮助他们在早期发现并修复潜在的软件缺陷，降低维护成本。

noiseless functions get perfect mutual information scores

[21]. Furthermore, the MIC value can be defined as

൫ܦ

௫ǡ௬

൯ൌ 

௫௬ழ஻ሺ௡ሻ

ሼܯሺܦ

௫ǡ௬

ሻሽ

(7)

where ܤ

ሺ

ሻ

is the upper bound of the grid size. In this paper,

we follow [21] to set ܤ

ሺ

ሻ

ൌ݊

଴Ǥ଺

as the default value. In

the context of defect prediction, given the class label ܻ, we

calculate the MIC value for each feature ܺ. We rank all orig-

inal features according to their MIC values and select a sub-

set of these features. This will be further explained in Section

III-B.

C. Hierarchical Agglomerative Clustering

We employ a Hierarchical Agglomerative Clustering

(HAC) algorithm to divide features into groups and thus to

reduce redundant features. In Section III-C, we will later

show how to group features via HAC based on the feature

values across software modules. Note that our goal is to

group features rather than modules and features are charac-

terized via their numeric values in software modules.

The clustering process of HAC is described below. First,

the algorithm treats each feature as a cluster and initializes

the distance of every two clusters. Then HAC merges the

nearest two clusters into a new cluster and calculates the

distance between the new cluster and other clusters. The

merging process repeats until a pre-defined criteria reaches

or all features belong to one group [33], [34]. HAC can form

a feature dendrogram of the resulting cluster hierarchy,

which serves as a valuable tool in visualization [35].

The distance between two features can be defined by the

similarity of these features, such as the cosine similarity and

the Pearson correlation coefficient. According to different

distance definitions, there are several kinds of commonly

used linkage methods for calculating the distance between

clusters, such as the single linkage method, the complete

linkage method, and the average linkage method. More de-

tailed description can be found in [36].

In this paper, we determine the final number of clusters

according to a statistic, called inconsistency coefficient dur-

ing clustering (in Section III-D). Based on this statistic, our

method avoids pointing out a specific number of clusters,

which was manually decided in existing work [29], [30].

III. O

ROPOSED

PPROACH

MICHAC

In this section, we first introduce the framework of our

proposed method; then we present the detailed steps in the

stages of feature ranking and feature clustering; finally, we

illustrate how to determine the number of clusters in the

stage of feature clustering.

A. Overview

We propose MICHAC, short for defect prediction via

Maximal Information Coefficient (MIC) with Hierarchical

Agglomerative Clustering (HAC). In MICHAC, we provide

a novel feature selection framework, which combines feature

ranking with feature clustering for defect prediction.

MICHAC selects an optimized subset of module features.

With the support of MICHAC, existing defect prediction

methods, such as Naïve Bayes, can benefit from a high-

quality training dataset that replaces the original one.

Fig1 illustrates the overall structure of MICHAC. This

structure consists of two major stages: feature ranking and

feature clustering. In the stage of feature ranking, we meas-

ure the relevance of features with respect to the class label

via a new feature ranking technique based on MIC (in Sec-

tion II-B); this stage filters out module features that have a

low correlation with the class label. In the stage of feature

clustering, we cluster features into groups based on HAC (in

Section II-C); this stage eliminates redundant features via

selecting one feature per cluster. As a result, we construct an

optimized subset of module features, which is used to replace

the original feature set in defect prediction.

B. Feature Ranking Stage Based on MIC

In the stage of feature ranking, we mainly conduct the

relevance analysis between each feature with respect to the

class label. That is, features that can distinguish whether a

module is defect-prone or not are selected for the next stage

(in Section III-C). We rank features independently, without

considering any learning algorithm.

The input of feature ranking is a set of defect data,

which can be used to build a predictive model in defect pre-

diction. As shown in Fig. 1, feature ranking consists of three

major steps. In Step n, we preprocess defect data, such as

removing features with only one value and non-numeric

features. In addition, we convert the class label of modules

into binary label. Specifically, we label modules with one or

more defects as 1, otherwise as 0. In Step o, we calculate

the MIC values between each feature ݔ and the class label ݕ

based on Equation 7 in Section II-B. In Step p, we sort all

features based on their MIC values in descending order and

Figure 1. Overview of our proposed approach, MICHAC.

oCompute MIC value for

each feature

pRank features and select

the top p features

qCluster features

rCalculate the increment of

inconsistency coefficient

and determine the cluster

number based on the

maximum increment

sSelect one feature from

each cluster

Feature Ranking

Feature Clustering

Initial feature subset

Raw data

Based on MIC

nPreprocess

Final feature subset

372

剩余11页未读，继续阅读

weixin_38659374

粉丝: 0
资源: 966

MICHAC：基于最大信息系数和层次聚类的特征选择在缺陷预测中的应用

galaxy:增量分层凝聚聚类（IHAC）的python实现

无监督学习：基于质心的聚类算法，即K-Means聚类，聚集聚类和基于密度的空间聚类

软件缺陷预测中基于聚类分析的特征选择方法_刘望舒1

复杂网络各节点聚类系数和网络平均聚类系数

聚类分析在软件缺陷预测特征选择中的应用

提升聚类性能：基于重建系数的子空间聚类融合算法

FECS：面向噪声软件故障预测的聚类特征选择方法

PCT算法：基于图节点属性的时间建模聚类预测

谱聚类详解：基于拉普拉斯矩阵的无向图聚类

地铁运营时段划分：基于客流特征的K-means聚类分析

最新资源