Building a Large-scale Software Programming
Taxonomy from Stackoverflow
Jiangang Zhu
School of Software
Shanghai Jiao Tong University
jszjgtws@sjtu.edu.cn
Beijun Shen*
School of Software
Shanghai Jiao Tong University
bjshen@sjtu.edu.cn
Xuyang Cai
School of Software
Shanghai Jiao Tong University
bakercxy@hotmail.com
Haofen Wang*
East China University of
Science & Technology
whfcarter@ecust.edu.cn
Abstract—Taxonomy is becoming indispensable to a growing
number of applications in software engineering such as software
repository mining and defect prediction. However, the existing
related taxonomies are always manually constructed. The sizes of
these taxonomies are small and their depths are limited. In order
to show the full potential of taxonomies in software engineering
applications, in this paper, we present the first large-scale software
programming taxonomy which is more comprehensive than any
existing ones. It contains 38,205 concepts and 68,098 subsumption
relations. Instead of learning from a open domain, we focus on
taxonomy construction from Stackoverflow which is one of the
largest QA websites about software programming. We propose a
machine learning based method with novel features to create a
taxonomy that captures the hierarchical semantic structure of tags
in Stackoverflow. This method executes iteratively to find as many
relations as possible. Experimental results show that our approach
achieves much better accuracy than baselines. Compared with
taxonomies related to software programming which are extracted
from the general-purpose taxonomies such as WikiTaxonomy,
Yago Taxonomy and Schema.org, our taxonomy has the widest
coverage of concepts, contains the largest number of subsumption
relations, and runs up to the deepest semantic hierarchy.
Keywords—Taxonomy Construction, Stackoverflow, Software
Engineering
I. INTRODUCTION
Taxonomy plays an important role in software engineer-
ing. For example, in software maintenance such as measuring
quality and predicting defects, taxonomies are used to measure
the relatedness between documents and create links between
bugs and committed changes [1]. In program comprehension,
taxonomies provide an effective way to compute the semantic
similarities between words from the comments and identifiers
in software [2].
However, most existing taxonomies used in these applica-
tions are often manually created according to application spe-
cific requirements and their sizes are not large enough. A recent
literature [3] argued that the quality and the scale of taxonomy
would significantly benefit the performance when applied in
software engineering. On the other hand, there have been a
considerable amount of research works on taxonomy construc-
tion [4], [5], [6], [7], [8]. The value of automatic taxonomy
construction is two folds. Automatic taxonomy construction
can achieve large scale taxonomies while manual construction
is a laborious process. Moreover, compared with automatic
approaches that are data-driven, taxonomies built manually are
*Corresponding author
Fig. 1: An example from Stackoverflow
often highly subjective. Unfortunately, the resulting taxonomies
of these existing automatic approaches would probably lead to
poor results when applied in software engineering for several
reasons. First is timeliness. The techniques in software engi-
neering are fast changing, while the general web pages and
encyclopedic sites are insensitive to this change and always
fail to update in time. So it is not suitable to select text
corpora such as general web pages or Wikipedia as its input.
Second is granularity. Since the input of traditional taxonomy
construction approaches is from a open domain, some fine-
grained terms about software programming cannot be found in
these taxonomies. For example, “hashmap” about a well-known
data structure is not included in either Yago Taxonomy [9]
or WikiTaxonomy [4] which are the largest existing public
available taxonomies.
Recently, Stackoverflow has becoming one of the largest
QA websites about software programming. Specifically, ques-
tions are the key elements in Stackoverflow. Besides the ques-
tion description, as shown in Fig. 1, a question is also associated
with tags and authors. Formally, a question q is a triple in
form of (t
q
, b
q
, T S
q
), where t
q
is the title of the question,
b
q
is the body while T S
q
is the tag set which annotate the
question. A tag a in Stackoverflow can be represented as a
DOI reference number: 10.18293/SEKE2015-135