A Novel Approach for Predicting the Probability of
Inconsistent Changes to Code Clones Based LDA
Lili Yin, Liping Zhang, Min Hou, Dongsheng Liu
Computer and information engineering college Inner Mongolia normal university, Hohhot, China
yinliligood@126.com
Abstract - Inconsistent changes to code clones can create faults
and, hence, lead to incorrect program behavior. Consequently, these
clones increase the change effort when software is maintained. In
order to improve software quality and to help programmers pre
attention the hidden trouble of clone inconsistent changes. In this
paper, Different from previous research, we predict the probability of
inconsistent changes to clones based LDA. This paper expands the
LDA (Latent Dirichlet Allocation) model application fields. The
experiment on a large open source software system is presented.
Experimental results show the feasibility of this technique.
Index Terms - Predicting Probability, Inconsistent Changes,
LDA
1. Introduction
Research in software maintenance has shown that many
programs contain a significant amount of duplicated (cloned)
code. Such cloned code is considered harmful for two
reasons: (1) multiple, possibly unnecessary, duplicates of code
increase maintenance costs and, (2) inconsistent changes to
cloned code can create faults and, hence, lead to incorrect
program behavior [1-2].
To shed light on the situation, we investigated the effects
of code cloning on program correctness. It is important to
understand, that clones do not directly cause faults but
inconsistent changes to clones can lead to unexpected
program behavior. A particularly dangerous type of change to
cloned code is the inconsistent bug fix. If a fault was found in
cloned code but not fixed in all clone instances, the system is
likely to still exhibit the incorrect behavior. To illustrate this,
Fig. 1 shows an example, where a missing null-check was
retrofitted in only one clone instance [3].
Previous studies were detecting inconsistent changes to
code clones. Researchers calculated how many code clones
are inconsistent changes in multiple versions of software. At
the same time they proved that the inconsistent changes to
clones is harmful to developers and maintainers. Therefore, it
is important to predict the probability of inconsistent changes
to code clones, in order to improve software quality and to
help programmer pre attention the hidden trouble of clone
inconsistent changes.
In this paper, we implement a mapping of multiple
versions of the clone group to obtain a clone group evolution.
Source code files which include code clones will be extracted
Natural Science Foundation of Inner Mongolia (No.2011MS0906)
The National Natural Science Foundation of China (No.61363017)
according to the evolution history of each clone group.
Then using the LDA model identifies themes of these source
code files and judge the stability of file functions. Finally, we
predict the probability of inconsistent changes to code clones.
Contributions of this paper:
Unlike existing studies that researchers only consider a
single feature of the code clones. The code clones in this
paper are placed in the context and we enrich code clones
feature information.
LDA model is applied to the field of code clones for the
first time. Simultaneously, we quantify the evolution
information of code clones and predict the probability of
inconsistent changes to code clones and provide data for
our follow up predicting harmfulness of code clones.
2. Related Work
Hindle et al.apply the Link model to commit log
messages in order to see what topics are being worked on by
developers at any given time[4]. The authors apply the Link
model (based on LDA) to a collection of commit logs over a
period of 30 days, then link topics from successive periods
using an 8-out-of-10 top-term similarity measure (i.e., if at
least 8 of the 10 top words for a topic at period i are shared by
a topic at period i+ 1, then the topics are considered the
same). The authors find LDA to be useful in identifying
activity trends and present several visualization techniques to
understand the results.
Linstead et al. use the Hall model (based on LDA) to
analyze source code evolution, claiming that LDA provides
better results than LSI [5]. The authors present line plots of
topic assignment percentages over time for two systems,
Eclipse and ArgoUML. These plots reveal integration points
and other changes that shape a project’s lifetime. We build on
this work by formalizing the approach, considering additional
topic metrics to better understand topic change events, and
providing a detailed, manual analysis of the topic change
events to validate and characterize the results of the approach.
Above knowable, the current researchers don't pay much
attention to the study predicting the probability of inconsistent
changes to code clones.
International Conference on Computer, Communications and Information Technology (CCIT 2014)
© 2014. The authors - Published by Atlantis Press