Dictionary Learning Based Software Defect Prediction
Xiao-Yuan Jing
1,2*
, Shi Ying
1
, Zhi-Wu Zhang
1,2
, Shan-Shan Wu
1,2
, Jin Liu
1
1
State Key Laboratory of Software Engineering, School of Computer, Wuhan University, Wuhan, China
2
College of Automation, Nanjing University of Posts and Telecommunications, Nanjing, China
* Corresponding author: jingxy_2000@126.com
ABSTRACT
In order to improve the quality of a software system, software
defect prediction aims to automatically identify defective
software modules for efficient software test. To predict software
defect, those classification methods with static code attributes
have attracted a great deal of attention. In recent years, machine
learning techniques have been applied to defect prediction. Due
to the fact that there exists the similarity among different
software modules, one software module can be approximately
represented by a small proportion of other modules. And the
representation coefficients over the pre-defined dictionary, which
consists of historical software module data, are generally sparse.
In this paper, we propose to use the dictionary learning technique
to predict software defect. By using the characteristics of the
metrics mined from the open source software, we learn multiple
dictionaries (including defective module and defective-free
module sub-dictionaries and the total dictionary) and sparse
representation coefficients. Moreover, we take the
misclassification cost issue into account because the
misclassification of defective modules generally incurs much
higher risk cost than that of defective-free ones. We thus propose
a cost-sensitive discriminative dictionary learning (CDDL)
approach for software defect classification and prediction. The
widely used datasets from NASA projects are employed as test
data to evaluate the performance of all compared methods.
Experimental results show that CDDL outperforms several
representative state-of-the-art defect prediction methods.
Categories and Subject Descriptors
D.2.9 [Management]: Software quality assurance (SQA), G.1.3
[Numerical Linear Algebra]: Sparse, structured, and very large
systems (direct and iterative methods), I.5.2 [Design
Methodology]: Classifier design and evaluation.
General Terms
Algorithms
Keywords
Software defect prediction, dictionary learning, sparse
representation, cost-sensitive discriminative dictionary learning
(CDDL).
1. INTRODUCTION
Software defect prediction is one of the most important
research topics in software engineering [1-2,57,59], which is an
efficient means to relieve the burden on software code inspection
or testing. To achieve the goal of detecting and correcting the
greatest number of defects in software, software defect prediction
enables the organization’s limited resource to be reasonably
allocated. It can be generally categorized into two types: static
and dynamic defect prediction technology. Static defect prediction
technology mainly refers to defect number prediction or defect
distribution prediction based on the defect-related metrics.
Dynamic defect prediction technology predicts the distribution of
the system defects over time by using the defect generated time.
Static prediction technique has been widely used, because it can
predict the defect proneness of new software modules with the
historical defect data so as to improve the quality of software [3-
4]. The key of static defect prediction technique is how to fully
analyze and utilize the existing historical data, and then build
more precise and effective binary classifiers of software modules.
In recent years, many popular classification methods, such as
support vector machine (SVM) [5-7], decision tree [8-11], neural
networks [12-13], Naïve Bayes [14-17], and cost-sensitive
learning methods [18-22], have been employed to achieve this
goal. However, in the field of software defect prediction, these
classification methods often encounter some difficulties, for
example, the class-imbalance problem [23-25] and the
misclassification cost issue [18]. Class-imbalance problem
indicates that a software system contains much fewer defective
modules than defective-free modules, which leads to negative
influence on decision of classifiers [26-29]. Classifying a
software module as defective-prone implies that more testers
should be invested in the verification activities, thus adding to the
development cost. Misclassifying a module as defective-free
carries the risk of system failure, which is also associated with
cost implications [58].
Sparse representation, a recently developed technique, arouses
much interest from researchers due to its effectiveness and
robustness. The idea of sparse representation is that information
of a signal can be efficiently represented or coded by a linear
fee. Request permissions from Permissions@acm.org.