使用TFIDF和深度学习进行故障定位：TFIDF-FL

PDF格式 | 635KB | 更新于2024-08-28 | 71 浏览量 | 举报

"TFIDF-FL是利用词频-逆文档频率和深度学习来定位软件故障的最新研究方法。该技术旨在改进现有的基于神经网络的故障定位方法，这些方法主要依赖于语句是否执行的信息，而无法体现语句在执行中的重要性。TFIDF-FL通过引入TF-IDF来量化语句对执行的影响程度，从而提高故障定位的准确性。实验证明，TFIDF-FL在8个真实世界的程序上显著提升了故障定位的效果。关键词包括调试、故障定位、词频、逆文档频率和深度学习。" 在软件开发过程中，调试是至关重要的一个环节，尤其是在大型复杂系统中，快速定位和修复故障可以显著提高开发效率。传统的基于语句执行与否的故障定位方法，虽然能识别出可能导致失败的可疑语句，但它们通常只关注二进制的执行状态（即执行或未执行），而忽略了语句在程序执行流中的实际影响力。这种局限性可能导致故障定位的精度下降。 TF-IDF-FL（Term Frequency-Inverse Document Frequency-Fault Localization）是一种创新的故障定位策略，它结合了信息检索领域中的TF-IDF概念和深度学习技术。TF-IDF是一种衡量词汇在文档集合中重要性的统计方法，它考虑了词汇在单个文档中的出现频率（Term Frequency, TF）以及在整个文档集合中的普遍性（Inverse Document Frequency, IDF）。在软件调试的背景下，TF-IDF-FL将每个语句视为一个“词汇”，其执行的频率作为TF，而其在不同执行路径上的独特性作为IDF，从而计算出每个语句对故障可能影响的程度。通过应用深度学习模型，TFIDF-FL能够学习到语句执行模式的深层次特征，并结合TF-IDF的得分，为每个语句分配一个故障关联度。这种综合评估使得TFIDF-FL能够在大量的代码中更准确地定位到那些可能导致故障的关键语句。实证研究表明，TFIDF-FL在8个真实世界的软件项目上表现出优越的性能，显著提升了故障定位的准确性和效率。这表明，将TF-IDF与深度学习相结合的方法在软件调试领域具有巨大的潜力，可以有效地辅助开发者更快地识别和修复问题，减少调试时间和成本。 TFIDF-FL是软件工程领域的一个重要进展，它通过引入新的度量标准和机器学习技术，提高了故障定位的精确性，为软件开发过程中的故障排查提供了更强大的工具。未来的研究可能会进一步探索如何优化TF-IDF-FL的性能，或者将其与其他调试技术集成，以实现更高效、更智能的软件故障诊断。

1860

IEICE TRANS. INF. & SYST., VOL.E102–D, NO.9 SEPTEMBER 2019

LETTER

TFIDF-FL: Localizing Faults Using Term Frequency-Inverse

Document Frequency and Deep Learning

Zhuo ZHANG

†

, Nonmember, Yan LEI

††a)

, Member, Jianjun XU

†b)

, Xiaoguang MAO

†

and Xi CHANG

†

, Nonmembers

SUMMARY Existing fault localization based on neural networks uti-

lize the information of whether a statement is executed or not executed to

identify suspicious statements potentially responsible for a failure. How-

ever, the information just shows the binary execution states of a statement,

and cannot show how important a statement is in executions. Consequently,

it may degrade fault localization eﬀectiveness. To address this issue, this

paper proposes TFIDF-FL by using term frequency-inverse document fre-

quency to identify a high or low degree of the inﬂuence of a statement in

an execution. Our empirical results on 8 real-world programs show that

TFIDF-FL signiﬁcantly improves fault localization eﬀectiveness.

key words: debugging, fault localization, term frequency, inverse docu-

ment frequency, deep learning

1. Introduction

In the process of software development, debugging usually

requires much manual involvement of debugging engineers.

Researchers have developed many fault localization tech-

niques to reduce the cost of debugging [1]. In recent years,

deep learning has witnessed a rapid development and shows

its promising ability of providing tremendous improvement

in robustness and accuracy [2].

Thus, some researchers have preliminarily used deep

neural networks with multiple hidden layers to discuss and

evaluate the potential of deep learning in fault localiza-

tion [3], [4]. They found that with the capability of esti-

mating complicated functions by learning a deep nonlinear

network’s structure and attaining distributed representation

of input data, deep neural networks exhibit strong learning

ability from sample data sets. However, the existing analysis

is still preliminary and needs much further study. For exam-

ple, it utilizes a matrix as the training samples, among which

the value of each element is either 1 meaning a statement

is executed or 0 denoting a statement is not executed.We

can observe that the binary information of a statement just

whether a statement is executed or not, whereas it cannot

show what degree of the inﬂuence of a statement in an exe-

cution. The existing analysis also uses small-sized programs

(i.e. hundreds of lines of code) with all seeded faults. The

Manuscript received November 14, 2018.

Manuscript revised March 19, 2019.

Manuscript publicized May 27, 2109.

†

The authors are with College of Computer, National Univer-

sity of Defense Technology, Changsha 410073, China.

††

The author is with School of Big Data & Software Engineer-

ing, Chongqing University, Chongqing 400044, China.

a) E-mail: yanlei@cqu.edu.cn (Corresponding author)

b) E-mail: jianjun.xu@yeah.net (Corresponding author)

DOI: 10.1587/transinf.2018EDL8237

recent research [5] has revealed that small-sized programs

with artiﬁcial faults are not useful for predicting which fault

localization techniques perform best on real faults. Further-

more, the previous research [6] has shown there are unique

features in test cases related to faults, e.g. the execution fre-

quency of each statement. However, the current approaches

use this feature of each statement in just one test case, and

do not consider their features from the view of all test cases.

Consequently, it may cause some bias, posing a negative ef-

fect on fault localization eﬀectiveness [7].

Therefore, this paper explores more about deep learn-

ing in improving fault l ocalization, i.e., we aim at obtain-

ing more insights by proposing an approach to identify the

impact of each statement in all test cases by using the fea-

tures from the view of all test cases, rather than a binary

status, and evaluating our results with large-scale programs.

Speciﬁcally, we propose TFIDF-FL: an eﬀective fault lo-

calization approach using term frequency-inverse document

frequency (TF-IDF) [8] to reﬂect how important of a state-

ment in the executions of a test suite. TFIDF-FL abstracts a

statement as a word and uses TF-IDF to construct a matrix

as the training samples, which reﬂect how important a word

(i.e. a statement) in the executions of a test suite. Then, it

uses the architecture of Muti-Layer Perceptrons (MLPs) to

learn a model from the training samples. Finally, TFIDF-

FL evaluates the suspiciousness of each statement of being

faulty by testing the trained model using a virtual test set.

We designed and performed an empirical study on 8 large

real-world programs. The results show that TFIDF-FL can

signiﬁcantly improves fault localization eﬀectiveness.

2. Approach

2.1 Overview

In information retrieval, TF-IDF is a numerical statistic that

is intended to reﬂect how i mportant a word is to a docu-

ment in a collection. It is one of the most popular term-

weighting schemes and is often used in searches of infor-

mation retrieval, text mining, and user modeling [8].The

TF-IDF is the product of two statistics, TF means term fre-

quency and IDF means inverse document frequency. The

term frequency is the number of times a word occurs in a

document while inverse document frequency is whether a

word is common or rare across all documents. The t erm fre-

quency of a word is low if it occurs few times in a document,

 2019 The Institute of Electronics, Information and Communication Engineers

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38738422

粉丝: 3

使用TFIDF和深度学习进行故障定位：TFIDF-FL

tfidf-topology:来自https的fork

TFIDF-master.zip_cup267_tfidf

ChineseSimilarity-gensim-tfidf-master.rar

热度权重文件weight.txt的内容形式是什么样的，请举个例子，进行完整的python代码演示

除了BOW TFIDF n-gram word2vec还有其他NLP特征处理方法吗，并说说这些方法的特点与优缺点

tfidf关键词提取英文

TF-IDF的Matlab程序

英文文本TFIDF提取关键词

最新资源