Stackoverflow上的大规模软件编程分类构建

40 浏览量更新于2024-08-29 收藏 637KB PDF 举报

"这篇研究论文探讨了如何利用Stackoverflow数据构建大规模的软件编程分类法，旨在为软件工程领域的各种应用提供更全面、更深入的知识框架。现有的编程分类法大多依赖手动构建，规模有限且深度不足。作者提出了一种新的方法，通过分析Stackoverflow这个大型编程问答网站的数据，构建了一个包含38,205个概念和68,098条子类关系的大型分类系统，展示了在软件工程领域中，自动化构建分类法的潜力和价值。" 正文：在软件工程中，分类法（Taxonomy）起着至关重要的作用，它能够帮助我们更好地理解和组织大量的编程知识，从而提升软件开发的效率和质量。例如，在软件仓库挖掘和缺陷预测等应用中，一个完善的分类体系可以作为有效的工具，协助研究人员发现模式、趋势和潜在问题。然而，目前的编程分类体系通常由专家手动创建，这导致了它们的规模相对较小，覆盖的概念有限，无法充分反映编程知识的复杂性和多样性。本文提出的创新之处在于，研究者选择Stackoverflow作为数据来源，构建了一个大规模的软件编程分类法。Stackoverflow是全球最大的编程问答社区之一，拥有海量的编程问题、答案和标签数据，这些数据反映了程序员在实际工作中遇到的问题和解决方法，具有很高的实用性和实时性。通过分析和学习这些数据，研究者可以自动地提取出编程概念和它们之间的层级关系，形成一个既广泛又深入的分类系统。构建这个分类法的过程可能包括以下几个步骤：首先，对Stackoverflow的标签数据进行预处理，去除噪声和不相关的标签；其次，使用自然语言处理技术（如词性标注、命名实体识别等）识别出编程概念；接着，通过统计分析和机器学习方法，识别出概念间的上下位关系（即子类关系）；最后，通过验证和调整，确保生成的分类结构准确、合理。这个包含38,205个概念和68,098条子类关系的分类法，不仅提供了丰富的编程知识结构，还可以作为一个基础框架，用于支持软件工程领域的其他研究，如代码推荐、代码搜索、开发者技能评估等。同时，由于它是基于真实的编程问答数据，因此更能反映实际编程实践中的需求和挑战。该研究论文揭示了利用大数据和自动化方法构建编程分类法的可能性，为软件工程的研究和实践提供了新的思路。通过这种方法，未来可以进一步扩展和细化分类体系，以适应不断发展的编程技术和社区知识。这种大规模的分类法不仅有助于提升软件开发的效率，还能够促进编程知识的传承和学习，对于整个软件工程领域的发展具有积极的推动作用。

Building a Large-scale Software Programming

Taxonomy from Stackoverﬂow

Jiangang Zhu

School of Software

Shanghai Jiao Tong University

jszjgtws@sjtu.edu.cn

Beijun Shen*

School of Software

Shanghai Jiao Tong University

bjshen@sjtu.edu.cn

Xuyang Cai

School of Software

Shanghai Jiao Tong University

bakercxy@hotmail.com

Haofen Wang*

East China University of

Science & Technology

whfcarter@ecust.edu.cn

Abstract—Taxonomy is becoming indispensable to a growing

number of applications in software engineering such as software

repository mining and defect prediction. However, the existing

related taxonomies are always manually constructed. The sizes of

these taxonomies are small and their depths are limited. In order

to show the full potential of taxonomies in software engineering

applications, in this paper, we present the ﬁrst large-scale software

programming taxonomy which is more comprehensive than any

existing ones. It contains 38,205 concepts and 68,098 subsumption

relations. Instead of learning from a open domain, we focus on

taxonomy construction from Stackoverﬂow which is one of the

largest QA websites about software programming. We propose a

machine learning based method with novel features to create a

taxonomy that captures the hierarchical semantic structure of tags

in Stackoverﬂow. This method executes iteratively to ﬁnd as many

relations as possible. Experimental results show that our approach

achieves much better accuracy than baselines. Compared with

taxonomies related to software programming which are extracted

from the general-purpose taxonomies such as WikiTaxonomy,

Yago Taxonomy and Schema.org, our taxonomy has the widest

coverage of concepts, contains the largest number of subsumption

relations, and runs up to the deepest semantic hierarchy.

Keywords—Taxonomy Construction, Stackoverﬂow, Software

Engineering

I. INTRODUCTION

Taxonomy plays an important role in software engineer-

ing. For example, in software maintenance such as measuring

quality and predicting defects, taxonomies are used to measure

the relatedness between documents and create links between

bugs and committed changes [1]. In program comprehension,

taxonomies provide an effective way to compute the semantic

similarities between words from the comments and identiﬁers

in software [2].

However, most existing taxonomies used in these applica-

tions are often manually created according to application spe-

ciﬁc requirements and their sizes are not large enough. A recent

literature [3] argued that the quality and the scale of taxonomy

would signiﬁcantly beneﬁt the performance when applied in

software engineering. On the other hand, there have been a

considerable amount of research works on taxonomy construc-

tion [4], [5], [6], [7], [8]. The value of automatic taxonomy

construction is two folds. Automatic taxonomy construction

can achieve large scale taxonomies while manual construction

is a laborious process. Moreover, compared with automatic

approaches that are data-driven, taxonomies built manually are

*Corresponding author

Fig. 1: An example from Stackoverﬂow

often highly subjective. Unfortunately, the resulting taxonomies

of these existing automatic approaches would probably lead to

poor results when applied in software engineering for several

reasons. First is timeliness. The techniques in software engi-

neering are fast changing, while the general web pages and

encyclopedic sites are insensitive to this change and always

fail to update in time. So it is not suitable to select text

corpora such as general web pages or Wikipedia as its input.

Second is granularity. Since the input of traditional taxonomy

construction approaches is from a open domain, some ﬁne-

grained terms about software programming cannot be found in

these taxonomies. For example, “hashmap” about a well-known

data structure is not included in either Yago Taxonomy [9]

or WikiTaxonomy [4] which are the largest existing public

available taxonomies.

Recently, Stackoverﬂow has becoming one of the largest

QA websites about software programming. Speciﬁcally, ques-

tions are the key elements in Stackoverﬂow. Besides the ques-

tion description, as shown in Fig. 1, a question is also associated

with tags and authors. Formally, a question q is a triple in

form of (t

, b

, T S

), where t

is the title of the question,

is the body while T S

is the tag set which annotate the

question. A tag a in Stackoverﬂow can be represented as a

DOI reference number: 10.18293/SEKE2015-135

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38621082

粉丝: 9
资源: 948

Stackoverflow上的大规模软件编程分类构建

Software.zhishi.schema：源自Stackoverflow的软件编程分类法

StackOverflow 创始人推荐图书

如何解决java.lang.StackOverflowError

StackOverflow

stackoverflowerror原因

StackOverflowError

stackoverflow访问慢

java.lang.stackoverflowerror_java.lang.StackOverflowError——如何解决StackOverflowError错误

mapstruct stackoverflowerror

java stackoverflow

最新资源