An Empirical Study on TensorFlow Program Bugs ISSTA’18, July 16–21, 2018, Amsterdam, Netherlands
(Lines 2-14). Second, a session object is created to launch the con-
structed computation graph and build a neural network. The execu-
tion phase can be further divided into two sub-phases: training and
testing. In this training phase (Lines 16-21), a set of labeled samples
are used to train the neural network, minimizing the model loss
by means of cross entropy. A gradient descent algorithm is often
deployed to carry out the minimization. In the training phase, the
network will be trained for numerous iterations. After a model is
trained, in the testing phase, it can be applied to classify samples
in a dataset (Line 22).
3 RESEARCH QUESTIONS
Our study aims to answer the following three research questions.
• RQ1: What are the symptoms and root causes of the bugs?
•
RQ2: What new challenges exist to detect the bugs and how
do TF users handle them?
•
RQ3: What new challenges exist to localize the bugs and
how do TF users handle them?
The rst research question concerns the characteristics of the bugs.
The symptoms help us understand the consequences of the bugs
and are useful in designing detection method. The root causes help
us understand the nature of the bugs and the connections between
root causes and symptoms are useful in designing fault localization
methods. The second and third research questions concern the
new challenges imposed by the paradigm shift from traditional
program to TF programs, with an emphasis on fault detection and
localization. When answering these questions about challenges, we
are also concerned about the solutions currently used by TF users.
Understanding these solutions helps the development of new fault
detection and localization techniques.
4 DATA COLLECTION
We collected TensorFlow bugs from two sources: StackOverow
pages and GitHub commits. StackOverow pages contain bugs that
might be dicult to debug: at least the TF user could not resolve
the bug quickly and has to ask a question for assistance. On the
other hand, GitHub commits contain bugs that might be dicult to
detect: at least the TF user did not discover it at the rst place and
committed into the project. Putting the two sources together, we
have a dataset of interest: the bugs those cause problems to the TF
users and those are worth studying.
To collect bugs from StackOverow pages, we used a search term
“tensorow answers:1 -how -install -build” in StackOverow’s search
engine. The parameter “answers:1” ensures that only questions with
at least one answer were considered. And other parameters “-how
-install -build” were used to lter out discussions about installment
and building of TensorFlow which we do not concern about. Then
we manually reviewed top 500 question returned by StackOverow
and found 87 questions related to TensorFlow application bugs.
Please note that StackOverow may contain both novices’ and
experts’ posts, and we believe both are important and should be
included in the study. The statistics of the QA pages can be found
in Table 1.
To collect bugs from GitHub commits, we searched for projects
with keyword “tensorow ” in GitHub’s search engine. Among the
search results, we selected 11 target projects that are well-maintained
with the highest numbers of commits and stars for further exam-
ination. The statistics of these projects are shown in Table 2. We
take into consideration commits between start date and end date to
collect bugs in each project. Then we searched commit messages
with keywords “bug, x, wrong, error, nan, inf, issue, fault, fail, crash”
in each project. In addition, we ltered out “typo” and merged pull
requests to eliminate irrelevant and duplicate commits. We man-
ually inspected the source code, commit messages, pull request
messages, and issue messages to identify coding bugs. As a result,
we found 82 commits which contain 88 bugs related to TensorFlow
application bugs on GitHub. For each commit, we read the commit
and pull request message to see if there were any associated issues,
and included the discussion thread of the issue into consideration.
The subjects were collected between July 2017 and May 2018.
We have calculated the time spending from posting the issues until
its resolving on Github issues and StackOverow QA pages. In
Github issues, the mean is 27,845 minutes and the median is 5,122
minutes. In StackOverow QA pages, the mean is 33,312 minutes
and the median is 177 minutes. When manual inspections are
involved, two authors performed the inspection separately and
discussed inconsistent issues until agreement. During the process,
one StackOverow bug and eight GitHub bugs identied by one
author were removed from the discussion.
Putting together, we got a dataset
3
of 175 bugs, including 87 col-
lected from StackOverow and 88 collected from GitHub. The scale
of our dataset is similar to other existing studies that require manual
inspection, e.g., Jin et al. conducted a study of performance bugs
and inspected 109 performance bugs [
19
], and Nasehi et al. con-
ducted a study on what makes a good code example and analyzed
163 StackOverow QA pages [26].
5 RQ1: SYMPTOMS AND ROOT CAUSES
5.1 Information Sources for Analysis
To answer the rst research question, we analyzed each bug in
our dataset to identify its root causes and symptoms. For GitHub
bugs, the root causes can be identied by the changes made in
the commits. We identied the symptoms of bugs by reading the
commit message, pull request messages and the associated issues.
For StackOverow bugs, we learnt the root causes of bugs by read-
ing the answers that provide a solution. We identied these bugs’
symptoms from the question description. Besides, we also tried
to reproduce the bugs to further understand their symptoms. We
were able to reproduce 75 out of 88 Github bugs and 76 out of 87
StackOverow bugs. The rest of the bugs were not reproducible be-
cause of dead links, missing datasets, or the requirement of specic
hardware. We summarized the common root causes and symptoms
of collected bugs into major categories and classied each bug
accordingly. Two authors performed classication separately, no
disagreement was found on StackOverow bugs and ve Github
bugs were classied dierently.
5.2 Results
The statistics of the symptoms (rows) and root causes (columns)
that we found from our analysis are given in Table 3. We identied
3
Our dataset is available at https://github.com/ForeverZyh/TensorFlow-Program-B
ugs.