8
Fairness is often domain specific. Regulated domains
include credit, education, employment, housing, and public
accommodation
4
.
To formulate fairness is the first step to solve the fairness
problems and build fair machine learning models. The liter-
ature has proposed many definitions of fairness but no firm
consensus is reached at this moment. Considering that the
definitions themselves are the research focus of fairness in
machine learning, we discuss how the literature formulates
and measures different types of fairness in Section 6.5.
3.4.8 Interpretability
Machine learning models are often applied to assist/make
decisions in medical treatment, income prediction, or per-
sonal credit assessment. It may be important for humans
to understand the ‘logic’ behind the final decisions, so that
they can build trust over the decisions made by ML [64],
[65], [66].
The motives and definitions of interpretability are di-
verse and still somewhat discordant [64]. Nevertheless, un-
like fairness, a mathematical definition of ML interpretabil-
ity remains elusive [65]. Referring to the work of Biran and
Cotton [67] as well as the work of Miller [68], we describe
the interpretability of ML as the degree to which an observer
can understand the cause of a decision made by an ML
system.
Interpretability contains two aspects: transparency (how
the model works) and post hoc explanations (other inform-
ation that could be derived from the model) [64]. Inter-
pretability is also regarded as a request by regulations like
GDPR [69], where the user has the legal ‘right to explana-
tion’ to ask for an explanation of an algorithmic decision that
was made about them. A thorough introduction of ML inter-
pretability can be referred to in the book of Christoph [70].
3.5 Software Testing vs. ML Testing
Traditional software testing and ML testing are different
in many aspects. To understand the unique features of
ML testing, we summarise the primary differences between
traditional software testing and ML testing in Table 1.
1) Component to test (where the bug may exist): traditional
software testing detects bugs in the code, while ML testing
detects bugs in the data, the learning program, and the
framework, each of which play an essential role in building
an ML model.
2) Behaviours under test: the behaviours of traditional
software code are usually fixed once the requirement is
fixed, while the behaviours of an ML model may frequently
change as the training data is updated.
3) Test input: the test inputs in traditional software testing
are usually the input data when testing code; in ML testing,
however, the test inputs in may have more diverse forms.
Note that we separate the definition of ‘test input’ and
‘test data’. In particular, we use ‘test input’ to refer to the
inputs in any form that can be adopted to conduct machine
learning testing; while ‘test data’ specially refers to the
data used to validate ML model behaviour (see more in
Section 2). Thus, test inputs in ML testing could be, but are
4
To prohibit discrimination ‘in a place of public accommodation on the
basis of sexual orientation, gender identity, or gender expression’ [63].
not limited to, test data. When testing the learning program,
a test case may be a single test instance from the test data or
a toy training set; when testing the data, the test input could
be a learning program.
4) Test oracle: traditional software testing usually assumes
the presence of a test oracle. The output can be verified
against the expected values by the developer, and thus the
oracle is usually determined beforehand. Machine learning,
however, is used to generate answers based on a set of input
values after being deployed online. The correctness of the
large number of generated answers is typically manually
confirmed. Currently, the identification of test oracles re-
mains challenging, because many desired properties are dif-
ficult to formally specify. Even for a concrete domain specific
problem, the oracle identification is still time-consuming
and labour-intensive, because domain-specific knowledge
is often required. In current practices, companies usually
rely on third-party data labelling companies to get manual
labels, which can be expensive. Metamorphic relations [71]
are a type of pseudo oracle adopted to automatically mitig-
ate the oracle problem in machine learning testing.
5) Test adequacy criteria: test adequacy criteria are used
to provide quantitative measurement on the degree of
the target software that has been tested. Up to present,
many adequacy criteria are proposed and widely adopted
in industry, e.g., line coverage, branch coverage, dataflow
coverage. However, due to fundamental differences of pro-
gramming paradigm and logic representation format for
machine learning software and traditional software, new
test adequacy criteria are required to take the characteristics
of machine learning software into consideration.
6) False positives in detected bugs: due to the difficulty in
obtaining reliable oracles, ML testing tends to yield more
false positives in the reported bugs.
7) Roles of testers: the bugs in ML testing may exist not
only in the learning program, but also in the data or the
algorithm, and thus data scientists or algorithm designers
could also play the role of testers.
4 PAPER COLLECTION AND REVIEW SCHEMA
This section introduces the scope, the paper collection ap-
proach, an initial analysis of the collected papers, and the
organisation of our survey.
4.1 Survey Scope
An ML system may include both hardware and software.
The scope of our paper is software testing (as defined in the
introduction) applied to machine learning.
We apply the following inclusion criteria when collecting
papers. If a paper satisfies any one or more of the following
criteria, we will include it. When speaking of related ‘aspects
of ML testing’, we refer to the ML properties, ML compon-
ents, and ML testing procedure introduced in Section 2.
1) The paper introduces/discusses the general idea of ML
testing or one of the related aspects of ML testing.
2) The paper proposes an approach, study, or tool/frame-
work that targets testing one of the ML properties or com-
ponents.
3) The paper presents a dataset or benchmark especially
designed for the purpose of ML testing.