FECS: a Cluster based Feature Selection Method
for Software Fault Prediction with Noises
Wangshu Liu
†
, Shulong Liu
†
, Qing Gu
†∗
, Xiang Chen
‡
, Daoxu Chen
†
†
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Email: liuws0707@gmail.com
‡
School of Computer Science and Technology, Nantong University, Nantong, China
Email: xchencs@ntu.edu.cn
Abstract—Noises are inevitable when mining software archives
for software fault prediction. Although some researchers have
investigated the noise tolerance of existing feature selection
methods, few studies focus on proposing new feature selection
methods with a certain noise tolerance. To solve this issue,
we propose a novel method FECS (FEature Clustering with
Selection strategies). This method includes two phases: a
feature clustering phase and a feature selection phase with three
different heuristic search strategies. During empirical studies, we
choose real-world software projects, such as Eclipse and NASA
and inject class level and feature level noises simultaneously
to imitate noisy datasets. After using classical feature selection
methods as the baseline, we confirm the effectiveness of FECS
and provide a guideline of using FECS after analyzing the
effects of varying either the percentage of selected features or
the noise rate.
Keywords—Software Quality Assurance, Software Fault Pre-
diction, Feature Selection, Classification Model, Noise Tolerance
I. INTRODUCTION
Construction of an effective software fault prediction (SFP)
model depends on the high quality datasets by mining software
archives, such as software configuration management and
bug tracking systems. After extracting software modules, re-
searchers have designed different code or process metrics (i.e.,
features) to measure these modules [1]. However, irrelevant
or redundant features can reduce the accuracy of the fault
prediction model. Previous studies have shown that feature
selection can improve the performance of the models in SFP
[2]–[5]. In previous work [5], we proposed a novel feature
selection method FECAR, which can effectively eliminate
both redundant and irrelevant features. However, noises are
inevitable when mining software archives [6], [7]. Although
some researchers have investigated the noise tolerance of
existing feature selection methods [8], to the best of our
knowledge, few researchers have proposed their own robust
feature selection methods with a certain noise tolerance ability.
Based on our previous work [5], we propose a robust
method FECS (FEature Clustering with Selection strategies),
to resist the inevitable noises in software datasets. FECS
includes two phases: a feature clustering phase for clustering
strongly correlated features and a feature selection phase
Correspondence author. Email: guq@nju.edu.cn
for selecting beneficial features. The main extension on our
previous work [5] is on the feature selection phase. In par-
ticular, we design three different heuristic search strategies
to select the most appropriate feature from each cluster. To
investigate the noise tolerance of FECS, we choose real-
world software projects, including Eclipse and NASA, as our
experimental subjects. We perform a set of data preprocessing
steps to guarantee the datasets noisy free. Then we inject
class level and feature level noises simultaneously to imitate
noisy datasets. After comparing our method FECS with other
classical methods, such as IG, CFS, and Consist on noise free
and noisy datasets respectively, we show the competitiveness
of our approach.
The main contribution of this paper can be highlighted as
follows:
• We propose a novel feature selection method FECS with
a certain noise tolerance for SFP.
• We perform thorough empirical studies based on real
software projects to verify the robustness of the method
FECS on both noise free and noisy datasets and provide
a guideline of using our method.
II. R
ELATED WORK
Nowadays software fault prediction is a hot research issue
[9] in software engineering data mining. By mining software
archives, researchers can extract modules and assign them
corresponding class (faulty or non-faulty). Then they use
different code metrics or process metrics [1] to measure these
modules. Finally they can use these constructed datasets to
build the fault prediction model. Based on this prediction
model, they can categorize new modules into two classes:
fault-prone (FP) or non-fault-prone (NFP).
Feature selection is used to identify and remove irrelevant
and redundant features to solve dimension curse issue in some
datasets. Previous research show the usefulness of feature
selection in SFP [2]–[5], [10]. Meanwhile noises are inevitable
when mining software archives. For example, the process of
linking issue reports with code changes may generate false
negative noises [6], mislabeled issue reports can generate false
positive noises [7]. Kim et al. investigate the noise tolerance
ability of existing fault prediction methods by manually in-
jecting noises [11]. Wald et al. made a comparison between
2015 IEEE 39th Annual International Computers, Software & Applications Conference
0730-3157/15 $31.00 © 2015 IEEE
DOI 10.1109/COMPSAC.2015.66
276