LDA模型预测代码克隆不一致性变化概率的新方法

124 浏览量更新于2024-08-27 收藏 1.37MB PDF 举报

本文主要探讨了一种新颖的方法，用于基于LDA预测代码克隆中不一致变化的可能性。作者Lili Yin、Liping Zhang、Min Hou和Dongsheng Liu来自内蒙古师范大学计算机与信息工程学院，他们关注到在软件维护过程中，代码克隆的不一致性可能导致错误的程序行为，增加了维护的难度，并对软件质量产生了负面影响。为了解决这个问题，该研究者们没有遵循传统的路径，而是创新性地将主题模型LDA（隐含狄利克雷分配）应用于预测代码克隆中的不一致变化概率。 LDA是一种流行的主题建模技术，它能够发现文档中的潜在主题并量化每个主题在文档中的分布。在本文中，研究人员扩展了LDA模型的应用领域，将其应用于软件开发环境，试图通过分析代码克隆的文本特征，识别出可能产生不一致更改的模式。他们的实验对象是大型开源软件系统，实验证明了这种方法的有效性和可行性。在介绍部分，研究指出已有的软件维护研究已经揭示了大量重复代码的存在，这些代码克隆被认为是软件开发中的一种常见现象。然而，克隆代码如果管理不当，不一致的变化可能会引入隐藏的问题，增加维护成本。因此，预测这种可能性对于提高软件质量、提前识别和修复潜在问题具有重要意义。为了实现这一目标，作者们设计了一个算法流程，该流程首先对代码克隆进行文本表示，然后利用LDA模型挖掘其中的主题模式。接着，通过分析这些主题的分布和变化趋势，构建一个预测模型，估计新提交的代码片段出现不一致变化的概率。这种方法不仅有助于开发者关注这些问题，还能作为软件维护策略的一部分，帮助团队优化资源分配和减少维护工作中的错误。实验结果表明，基于LDA的预测方法在实际应用中显示出良好的性能，能够准确地识别出那些可能产生不一致变化的代码区域，从而支持更有效的软件维护决策。这项研究不仅填补了软件维护领域的一个知识空白，也为其他研究人员提供了新的思路和技术工具，进一步推动了软件质量保障的研究进展。

A Novel Approach for Predicting the Probability of

Inconsistent Changes to Code Clones Based LDA

Lili Yin, Liping Zhang, Min Hou, Dongsheng Liu

Computer and information engineering college Inner Mongolia normal university, Hohhot, China

yinliligood@126.com

Abstract - Inconsistent changes to code clones can create faults

and, hence, lead to incorrect program behavior. Consequently, these

clones increase the change effort when software is maintained. In

order to improve software quality and to help programmers pre

attention the hidden trouble of clone inconsistent changes. In this

paper, Different from previous research, we predict the probability of

inconsistent changes to clones based LDA. This paper expands the

LDA (Latent Dirichlet Allocation) model application fields. The

experiment on a large open source software system is presented.

Experimental results show the feasibility of this technique.

Index Terms - Predicting Probability, Inconsistent Changes,

LDA

1. Introduction



Research in software maintenance has shown that many

programs contain a significant amount of duplicated (cloned)

code. Such cloned code is considered harmful for two

reasons: (1) multiple, possibly unnecessary, duplicates of code

increase maintenance costs and, (2) inconsistent changes to

cloned code can create faults and, hence, lead to incorrect

program behavior [1-2].

To shed light on the situation, we investigated the effects

of code cloning on program correctness. It is important to

understand, that clones do not directly cause faults but

inconsistent changes to clones can lead to unexpected

program behavior. A particularly dangerous type of change to

cloned code is the inconsistent bug fix. If a fault was found in

cloned code but not fixed in all clone instances, the system is

likely to still exhibit the incorrect behavior. To illustrate this,

Fig. 1 shows an example, where a missing null-check was

retrofitted in only one clone instance [3].

Previous studies were detecting inconsistent changes to

code clones. Researchers calculated how many code clones

are inconsistent changes in multiple versions of software. At

the same time they proved that the inconsistent changes to

clones is harmful to developers and maintainers. Therefore, it

is important to predict the probability of inconsistent changes

to code clones, in order to improve software quality and to

help programmer pre attention the hidden trouble of clone

inconsistent changes.

In this paper, we implement a mapping of multiple

versions of the clone group to obtain a clone group evolution.

Source code files which include code clones will be extracted

Natural Science Foundation of Inner Mongolia (No.2011MS0906)

The National Natural Science Foundation of China (No.61363017)

according to the evolution history of each clone group.

Then using the LDA model identifies themes of these source

code files and judge the stability of file functions. Finally, we

predict the probability of inconsistent changes to code clones.

Contributions of this paper:

 Unlike existing studies that researchers only consider a

single feature of the code clones. The code clones in this

paper are placed in the context and we enrich code clones

feature information.

 LDA model is applied to the field of code clones for the

first time. Simultaneously, we quantify the evolution

information of code clones and predict the probability of

inconsistent changes to code clones and provide data for

our follow up predicting harmfulness of code clones.

2. Related Work

Hindle et al.apply the Link model to commit log

messages in order to see what topics are being worked on by

developers at any given time[4]. The authors apply the Link

model (based on LDA) to a collection of commit logs over a

period of 30 days, then link topics from successive periods

using an 8-out-of-10 top-term similarity measure (i.e., if at

least 8 of the 10 top words for a topic at period i are shared by

a topic at period i+ 1, then the topics are considered the

same). The authors find LDA to be useful in identifying

activity trends and present several visualization techniques to

understand the results.

Linstead et al. use the Hall model (based on LDA) to

analyze source code evolution, claiming that LDA provides

better results than LSI [5]. The authors present line plots of

topic assignment percentages over time for two systems,

Eclipse and ArgoUML. These plots reveal integration points

and other changes that shape a project’s lifetime. We build on

this work by formalizing the approach, considering additional

topic metrics to better understand topic change events, and

providing a detailed, manual analysis of the topic change

events to validate and characterize the results of the approach.

Above knowable, the current researchers don't pay much

attention to the study predicting the probability of inconsistent

changes to code clones.

International Conference on Computer, Communications and Information Technology (CCIT 2014)

118

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38609247

粉丝: 8
资源: 970

LDA模型预测代码克隆不一致性变化概率的新方法

Amdahls law for predicting the future of multicores considered harmful.pdf

Predicting the Popularity of online content.pdf

A Novel Method for Predicting Essential Proteins Based on SubcellularLocalization, Orthology and PPI Networks

A novel method of predicting microRNA-disease associations based on microRNA, disease, gene and environment factor networks

A Computational Method Based on the Integration of Heterogeneous Networks for Predicting Disease-Gene Associations

A novel Multi-Agent Ada-Boost algorithm for predicting protein structural class with the information of protein secondary structure.

SVM Framework for predicting the PVT Properties of Crude-Oil Systems

A method based on diffraction theory for predicting 3D focusing performance of compound refractive X-ray lenses

CROSS SPECTRUM ANALYSIS AND ITS APPLICATION TO PREDICTING THE RAINFALL OF JIANGSU IN THE FLOOD PERIOD (1983年)

A Feature Selection Approach of Inconsistent Decision Systems in Rough Set

最新资源