OSSEAN: Mining Crowd Wisdom in Open Source Communities
Gang Yin, Tao Wang, Huaimin Wang, Qiang Fan, Yang Zhang, Yue Yu, Cheng Yang
National Laboratory of Parallel and Distributed Computing
School of Computer, National University of Defense Technology
ChangSha, Hunan, China
yingang@nudt.edu.cn, taowang.2005@outlook.com, whm_w@163.com
Abstract—Nowadays open source software represents a
successful crowd-based software production model and is
becoming an ecosystem combining huge amounts of software
producers (such as software developers) and consumers (such as
software users and customers). Lots of research work has been
conducted on analyzing software artifacts created by producers,
but few of them reveal the power of feedback from consumers
which we believe is very important for the evaluation and
evolution of open source software. This paper introduces
OSSEAN, a platform for Open Source Software Evaluating,
Analyzing and Networking. OSSEAN divides the open source
communities into two groups: software production communities
and software consumption communities. The former contain
structured software artifacts such as projects, source code and
issues, while the latter are full of textual documents with rich
semantics of user feedback. We show the power of OSSEAN with
some interesting demos by analyzing more than 200 thousands of
open source projects and 10 million documents.
Keywords—Open Source; Crowd Wisdom; Software
Production Communities, Software Consumption Communities;
OSSEAN
I. INTRODUCTION
Open source communities successfully leverage the power
of the crowd in software production, and have great impacts on
many stages of software development and applications in a
global open source ecosystem. With the development and
division of open source communities, software programmers
and testers are attracted by software production communities
such as SourceForge and Github, while the newcomers, users
and customers are more likely going to software consumption
communities such as StackOverflow, Slashdot and CSDN
(http://www.csdn.net, the biggest IT community site for
Chinese speaking users in the world). These two kinds of
communities complement each other and greatly expand the
scope of traditional software development activities to global
software evaluation and evolution.
The production communities mainly help software
developers manage their development processes and artifacts.
For example, Github provides development tools such as
version control and issue tracking, and social communication
tools such as @mention [1]. SourceForge provides more
complete toolkits for distributed collaboration and management,
including mailing lists, feature request, etc. These tools store
huge amounts of software engineering data for structured
software artifacts. On the other hand, the consumption
communities are usually the crowd-oriented knowledge sharing
platforms attracting tens of millions of users. The data
generated in in these communities are usually textual posts
which are reacting more quickly than the development requests
submitted in production communities. For example,
StackOverflow has an answer rate above 90% and a median
answer time of only 11 minutes [2], while the average
responding time for an issue in Android issue tracking system
(a typical production community) is about 31 days. The
consumption communities are becoming the source of the
crowd wisdom for the evaluation and evolution of open source
software in production communities.
We propose a new approach, OSSEAN (Open Source
Software Evaluating, Analyzing and Networking), to leverage
the crowd wisdom in consumption communities to support the
software development in production communities. OSSEAN is
composed of two steps: firstly collects the set of documents in
consumption communities for each software in production
communities, then use the documents to make evaluation,
comparison and ranking for the software. According to our
experiments, OSSEAN successfully discovers more than 8
million documents for more than 238 thousands projects. In
this paper, we use three promising demos to show the potential
applications of OSSEAN.
The structure of this paper is as follows. In Section II some
closely related systems are introduced and discussed. In
Section III key mechanisms of our approach are described. In
Section IV, we show some preliminary but promising demos of
OSSEAN. We conclude the paper in Section V.
II. R
ELATED WORK
A lot of research work has been conducted on collecting
and mining data in open source communities.
The large amounts of high-quality source code publicly
available over the internet have attracted great attention from
researchers. Sushil et. al constructed an Internet-scale software
repository Sourceer [7]. It employs the structural information
like reference in source code, the dependences between
libraries and so on to achieve large scale source code indexing
and searching. OCEAN [8] present a federated search engine
that simultaneously retrieves source code from existing open
source code search engine sites including Koders, Krugle,
Merobase and Google Code.
In industry, many analysis tools and services have been
provided by different companies. Coverity Scan [9] mainly
focus on analyzing the quality and security of open source
software by providing scan and test services. Converity Scan
helps developers to identify critical quality and security
defects that are hard to find by other methods, and can provide
valuable information for users to locate and fix the identified
defects. Until March 2014, the lines of source code they scan
have reached 300 million. Many famous open source projects
2015 IEEE Symposium on Service-Oriented System Engineering
978-1-4799-8356-8/15 $31.00 © 2015 IEEE
DOI 10.1109/SOSE.2015.51
367