机器学习测试：调研、现状与前景

需积分: 5 62 浏览量更新于2024-06-15 收藏 1.48MB PDF 举报

"Machine Learning Testing- Survey， Landscapes and Horizons.pdf" 这篇论文全面调查了机器学习系统的测试技术，即机器学习测试（ML测试）的研究。它涵盖了144篇关于测试属性（如正确性、鲁棒性和公平性）、测试组件（如数据、学习程序和框架）、测试工作流程（如测试生成和测试评估）以及应用场景（如自动驾驶和机器翻译）的论文。论文还分析了关于数据集、研究趋势和研究重点的趋势，并在ML测试中提出了研究挑战和有前景的研究方向。关键词：机器学习、软件测试、深度神经网络引言机器学习的广泛应用引发了对其可信赖性的自然关注。例如，自动驾驶系统和医疗治疗等安全关键应用，增加了与正确性、鲁棒性、隐私、效率和公平性相关行为的重要性。软件测试作为确保这些特性的重要手段，对于机器学习系统尤其关键。传统软件测试方法可能不适用于机器学习，因为它们处理的是预测模型而非明确的程序逻辑，这需要开发新的测试策略和技术。 1. 测试属性论文探讨了机器学习系统的一些关键测试属性。正确性测试旨在验证模型是否按照预期产生结果。鲁棒性测试检查模型在输入数据微小变化或异常情况下的表现，以确保其不会轻易失效。公平性测试则关注模型是否对不同群体保持一致，避免歧视性决策。 2. 测试组件测试组件部分涉及数据、学习算法和框架。数据测试包括验证训练和测试数据的质量、多样性和代表性。学习程序测试关注算法的实现和配置，以确保其正确运行。框架测试则涉及评估不同机器学习库和平台的稳定性和兼容性。 3. 测试工作流程测试生成通常涉及创建能够触发模型错误的测试用例。这可能通过自动化手段实现，如生成对抗样本或利用元学习。测试评估则包括度量模型性能、比较不同模型以及跟踪性能随时间的变化。 4. 应用场景应用场景部分讨论了特定领域的测试挑战，如自动驾驶中的障碍物检测和避障，以及机器翻译中的准确性和语境理解。这些领域的测试要求更高的可靠性和准确性标准。 5. 趋势和挑战随着数据集规模的增加和模型复杂性的提升，测试面临新的挑战。此外，模型解释性和可解释性测试也是当前的研究热点。同时，如何有效测试分布式和实时机器学习系统也是一个待解决的问题。 6. 研究方向论文提出了一些未来的研究方向，如开发更有效的测试用例生成方法，建立更全面的评估指标，以及探索自动化测试工具和框架。此外，研究如何将测试集成到机器学习的整个生命周期，以及如何处理模型的不确定性和动态性也是重要课题。这篇综述提供了机器学习测试的全面视图，揭示了该领域的重要进展和未来的研究机会，对于推动机器学习的可靠性、安全性和可信赖性具有重要意义。

to which a system or component can function correctly in

the presence of invalid inputs or stressful environmental

conditions’. Adopting a similar spirit to this deﬁnition, we

deﬁne the robustness of ML as follows:

Deﬁnition 6 (Robustness). Let S be a machine learning

system. Let E(S) be the correctness of S. Let δ(S)

be the machine learning system with perturbations on

any machine learning components such as the data, the

learning program, or the framework. The robustness

of a machine learning system is a measurement of the

difference between E(S) and E(δ(S)):

r = E(S) − E(δ(S)) (4)

Robustness thus measures the resilience of an ML sys-

tem’s correctness in the presence of perturbations.

A popular sub-category of robustness is called adversarial

robustness. For adversarial robustness, the perturbations are

designed to be hard to detect. Following the work of Katz

et al. [57], we classify adversarial robustness into local

adversarial robustness and global adversarial robustness.

Local adversarial robustness is deﬁned as follows.

Deﬁnition 7 (Local Adversarial Robustness). Let x a test

input for an ML model h. Let x

be another test input

generated via conducting adversarial perturbation on x.

Model h is δ-local robust at input x if for any x

∀x

: ||x − x

≤ δ → h(x) = h(x

) (5)

||·||

represents p-norm for distance measurement. The

commonly used p cases in machine learning testing are 0, 2,

and ∞. For example, when p = 2, i.e. ||x − x

represents

the Euclidean distance of x and x

. In the case of p = 0,

it calculates the element-wise difference between x and x

When p = ∞, it measures the the largest element-wise

distance among all elements of x and x

Local adversarial robustness concerns the robustness at

one speciﬁc test input, while global adversarial robustness

measures robustness against all inputs. We deﬁne global

adversarial robustness as follows.

Deﬁnition 8 (Global Adversarial Robustness). Let x a test

input for an ML model h. Let x

be another test input

generated via conducting adversarial perturbation on x.

Model h is -global robust if for any x and x

∀x, x

: ||x − x

≤ δ → h(x) − h(x

) ≤  (6)

3.4.4 Security

The security of an ML system is the system’s resilience

against potential harm, danger, or loss made via manipu-

lating or illegally accessing ML components.

Security and robustness are closely related. An ML sys-

tem with low robustness may be insecure: if it is less robust

in resisting the perturbations in the data to predict, the

system may more easily fall victim to adversarial attacks;

For example, if it is less robust in resisting training data per-

turbations, it may also be vulnerable to data poisoning (i.e.,

changes to the predictive behaviour caused by adversarially

modifying the training data).

Nevertheless, low robustness is just one cause of security

vulnerabilities. Except for perturbations attacks, security

issues also include other aspects such as model stealing or

extraction. This survey focuses on the testing techniques

on detecting ML security problems, which narrows the

security scope to robustness-related security. We combine

the introduction of robustness and security in Section 6.3.

3.4.5 Data Privacy

Privacy in machine learning is the ML system’s ability to

preserve private data information. For the formal deﬁnition,

we use the most popular differential privacy taken from the

work of Dwork [58].

Deﬁnition 9 (-Differential Privacy). Let A be a randomised

algorithm. Let D

and D

be two training data sets that

differ only on one instance. Let S be a subset of the

output set of A. A gives -differential privacy if

P r[A(D

) ∈ S] ≤ exp() ∗ P r[A(D

) ∈ S] (7)

In other words, -Differential privacy is a form of -

contained bound on output change in responding to single

input change. It provides a way to know whether any one

individual’s data has has a signiﬁcant effect (bounded by )

on the outcome.

Data privacy has been regulated by law makers,

for example, the EU General Data Protection Regula-

tion (GDPR) [59] and California Consumer Privacy Act

(CCPA) [60]. Current research mainly focuses on how to

present privacy-preserving machine learning, instead of

detecting privacy violations. We discuss privacy-related re-

search opportunities and research directions in Section 10.

3.4.6 Efﬁciency

The efﬁciency of a machine learning system refers to its

construction or prediction speed. An efﬁciency problem

happens when the system executes slowly or even inﬁnitely

during the construction or the prediction phase.

With the exponential growth of data and complexity of

systems, efﬁciency is an important feature to consider for

model selection and framework selection, sometimes even

more important than accuracy [61]. For example, to deploy

a large model to a mobile device, optimisation, compres-

sion, and device-oriented customisation may be performed

to make it feasible for the mobile device execution in a

reasonable time, but accuracy may sacriﬁce to achieve this.

3.4.7 Fairness

Machine learning is a statistical method and is widely

adopted to make decisions, such as income prediction and

medical treatment prediction. Machine learning tends to

learn what humans teach it (i.e., in form of training data).

However, humans may have bias over cognition, further

affecting the data collected or labelled and the algorithm

designed, leading to bias problems.

The characteristics that are sensitive and need to be

protected against unfairness are called protected character-

istics [62] or protected attributes and sensitive attributes. Ex-

amples of legally recognised protected classes include race,

colour, sex, religion, national origin, citizenship, age, preg-

nancy, familial status, disability status, veteran status, and

genetic information.

Fairness is often domain speciﬁc. Regulated domains

include credit, education, employment, housing, and public

accommodation

To formulate fairness is the ﬁrst step to solve the fairness

problems and build fair machine learning models. The liter-

ature has proposed many deﬁnitions of fairness but no ﬁrm

consensus is reached at this moment. Considering that the

deﬁnitions themselves are the research focus of fairness in

machine learning, we discuss how the literature formulates

and measures different types of fairness in Section 6.5.

3.4.8 Interpretability

Machine learning models are often applied to assist/make

decisions in medical treatment, income prediction, or per-

sonal credit assessment. It may be important for humans

to understand the ‘logic’ behind the ﬁnal decisions, so that

they can build trust over the decisions made by ML [64],

[65], [66].

The motives and deﬁnitions of interpretability are di-

verse and still somewhat discordant [64]. Nevertheless, un-

like fairness, a mathematical deﬁnition of ML interpretabil-

ity remains elusive [65]. Referring to the work of Biran and

Cotton [67] as well as the work of Miller [68], we describe

the interpretability of ML as the degree to which an observer

can understand the cause of a decision made by an ML

system.

Interpretability contains two aspects: transparency (how

the model works) and post hoc explanations (other inform-

ation that could be derived from the model) [64]. Inter-

pretability is also regarded as a request by regulations like

GDPR [69], where the user has the legal ‘right to explana-

tion’ to ask for an explanation of an algorithmic decision that

was made about them. A thorough introduction of ML inter-

pretability can be referred to in the book of Christoph [70].

3.5 Software Testing vs. ML Testing

Traditional software testing and ML testing are different

in many aspects. To understand the unique features of

ML testing, we summarise the primary differences between

traditional software testing and ML testing in Table 1.

1) Component to test (where the bug may exist): traditional

software testing detects bugs in the code, while ML testing

detects bugs in the data, the learning program, and the

framework, each of which play an essential role in building

an ML model.

2) Behaviours under test: the behaviours of traditional

software code are usually ﬁxed once the requirement is

ﬁxed, while the behaviours of an ML model may frequently

change as the training data is updated.

3) Test input: the test inputs in traditional software testing

are usually the input data when testing code; in ML testing,

however, the test inputs in may have more diverse forms.

Note that we separate the deﬁnition of ‘test input’ and

‘test data’. In particular, we use ‘test input’ to refer to the

inputs in any form that can be adopted to conduct machine

learning testing; while ‘test data’ specially refers to the

data used to validate ML model behaviour (see more in

Section 2). Thus, test inputs in ML testing could be, but are

To prohibit discrimination ‘in a place of public accommodation on the

basis of sexual orientation, gender identity, or gender expression’ [63].

not limited to, test data. When testing the learning program,

a test case may be a single test instance from the test data or

a toy training set; when testing the data, the test input could

be a learning program.

4) Test oracle: traditional software testing usually assumes

the presence of a test oracle. The output can be veriﬁed

against the expected values by the developer, and thus the

oracle is usually determined beforehand. Machine learning,

however, is used to generate answers based on a set of input

values after being deployed online. The correctness of the

large number of generated answers is typically manually

conﬁrmed. Currently, the identiﬁcation of test oracles re-

mains challenging, because many desired properties are dif-

ﬁcult to formally specify. Even for a concrete domain speciﬁc

problem, the oracle identiﬁcation is still time-consuming

and labour-intensive, because domain-speciﬁc knowledge

is often required. In current practices, companies usually

rely on third-party data labelling companies to get manual

labels, which can be expensive. Metamorphic relations [71]

are a type of pseudo oracle adopted to automatically mitig-

ate the oracle problem in machine learning testing.

5) Test adequacy criteria: test adequacy criteria are used

to provide quantitative measurement on the degree of

the target software that has been tested. Up to present,

many adequacy criteria are proposed and widely adopted

in industry, e.g., line coverage, branch coverage, dataﬂow

coverage. However, due to fundamental differences of pro-

gramming paradigm and logic representation format for

machine learning software and traditional software, new

test adequacy criteria are required to take the characteristics

of machine learning software into consideration.

6) False positives in detected bugs: due to the difﬁculty in

obtaining reliable oracles, ML testing tends to yield more

false positives in the reported bugs.

7) Roles of testers: the bugs in ML testing may exist not

only in the learning program, but also in the data or the

algorithm, and thus data scientists or algorithm designers

could also play the role of testers.

4 PAPER COLLECTION AND REVIEW SCHEMA

This section introduces the scope, the paper collection ap-

proach, an initial analysis of the collected papers, and the

organisation of our survey.

4.1 Survey Scope

An ML system may include both hardware and software.

The scope of our paper is software testing (as deﬁned in the

introduction) applied to machine learning.

We apply the following inclusion criteria when collecting

papers. If a paper satisﬁes any one or more of the following

criteria, we will include it. When speaking of related ‘aspects

of ML testing’, we refer to the ML properties, ML compon-

ents, and ML testing procedure introduced in Section 2.

1) The paper introduces/discusses the general idea of ML

testing or one of the related aspects of ML testing.

2) The paper proposes an approach, study, or tool/frame-

work that targets testing one of the ML properties or com-

ponents.

3) The paper presents a dataset or benchmark especially

designed for the purpose of ML testing.

剩余36页未读，继续阅读

m0_62488776

粉丝: 1041
资源: 64

机器学习测试：调研、现状与前景

machine learning.pdf

Machine Learning.pdf

Machine Learning.pdf2

wipo-专利态势报告：海洋遗传资源（英文）-4-129页.pdf

英文[Csharp] Prentice Hall - Thinking In C# .NET.pdf

Circuitscape-4.0.5-x64-setup.rar

SAP S4HANA Cloud for Two-Tier ERP Landscapes.zip

Cisco.Next-Generation.Security.Solutions.1587144468.epub

Landscapes Part 2.rar模型资源unity模型资源下载

unity3d 游戏场景模型 地形地貌素材包 Landscapes Part2.zip

最新资源

unity3d 游戏场景模型地形地貌素材包 Landscapes Part2.zip