Mining Requirements Knowledge from
Collections of Domain Documents
Xiaoli Lian
∗
, Mona Rahimi
+
,
Jane Cleland-Huang
+
, Li Zhang
∗
+
University of Notre Dame, South Bend IN, USA.
∗
Beihang University, Beijing, China.
Email: {lianxiaoli,lily}@buaa.edu.cn,
m.rahimi@acm.org, JaneClelandHuang@nd.edu
Remo Ferrari and Michael Smith
Siemens Industry
Rail Automation, New York, USA
remo.ferrari@siemens.com,
michael-smith@siemens.com
Abstract—When organizations enter domains that are entirely
new to them, they need to invest significant time and effort
to acquire domain knowledge. This typically involves searching
through a broad set of domain documents, retrieving relevant
ones, and analyzing the textual content in order to discover
and specify pertinent requirements. Depending on the nature
of the domain and the availability of documentation, this task
can be extremely time-consuming and may require non-trivial
human effort. Furthermore, the task must often be performed
repeatedly throughout early phases of the project. In this paper
we first explore the effort needed to manually build a high-
level domain model capturing the functional components. We
then present MaRK (Mining Requirements Knowledge), which
identifies and retrieves the documents containing descriptions of
functional components in the domain model. Domain analysts can
use this information to to specify requirements. We introduce
and evaluate an algorithm which ranks domain documents
according to their relevance to a component and then highlights
sections of text which are likely to contain requirements-related
information. We describe our process within the context of the
Positive Train Control (PTC) domain with a repository of of 523
documents, representing 852MB of data. We empirically evaluate
the MaRK relevance algorithm and its ability to retrieve relevant
requirements knowledge for requirements related to PTC’s On-
Board Unit.
I. INTRODUCTION
When entering an entirely new domain, software and sys-
tems engineers typically engage in a process of knowledge
discovery through an activity referred to as Domain Analysis.
Defined by James Neighbors in the 1980s as the process
of analyzing related software systems in order to identify
their commonalities and variabilities [26], domain analysis
can enable significant reuse at requirements, design, and
implementation levels [9]. Sources of domain knowledge
usually include technical literature, existing implementations,
customer surveys, expert advice, requirements specifications
[3] and online product descriptions [15]. Common techniques
for domain analysis include in-depth reviews of the require-
ments, design, code, and other product artifacts for a relatively
small number of existing systems [32], analysis of a large
numbers of rather shallow online product descriptions [10],
or searching the web to retrieve and analyze a broad set
of publicly available documents describing products in the
domain [28], [6], [25].
The continually expanding availability of accessible docu-
ments for a broad genre of domains, makes web-mining par-
ticularly appealing. However, there are challenges in mining
such documents which tend to be textually-rich, generally
unstructured, and contain highly redundant and sometimes
incomplete descriptions of various system components and
features. Our goal is to leverage such domain documents to
extract a functional domain model describing components,
communication mechanisms, and associated processes. Our in-
dustrial collaborators have stated that in their current practice,
performing this task takes them “enormous amounts of engi-
neering time”. They articulated several goals aimed at reducing
the excessive effort needed to acquire requirements knowledge
from repositories of domain documents. These goals included
generating an overview of the domain, quickly identifying
documents that were relevant to specific components, and
providing affordances to visually explore relevant parts of the
documents.
In this paper we present our approach, which we refer
to as ‘Mining Requirements Knowledge’ (MaRK). MaRK is
designed to reduce human effort by providing semi-automated
support for engineers tasked with discovering requirements
knowledge. We adopt the definition of a requirement as “a
statement of what the system must do, how it must behave,
the properties it must exhibit, the qualities it must possess,
and the constraints that the system and its development must
satisfy” [27]. Furthermore, we define requirements knowledge
as information that is “helpful for answering requirements-
related questions in any phase of a software project [24]”.
Requirements knowledge is therefore diverse in nature, and
can be retrieved from various sources. In this work, the
repository of domain documents from which requirements
knowledge is retrieved, is particularly diverse and includes
architectural documents, functional descriptions, regulations,
and so on. Our approach is designed for use in any domain
for which domain documents describing the major components
of the domain are available.
A. Domain: Positive Train Control
We apply MaRK to a project in the transportation domain,
focusing on Positive Train Control (PTC) [23], [17]. PTC
2016 IEEE 24th International Requirements Engineering Conference
2332-6441/16 $31.00 © 2016 IEEE
DOI 10.1109/RE.2016.50
156
RE 2016, Beijing, China
Research Paper