主动学习理论：探索与优化

需积分: 13 119 浏览量更新于2024-07-17 收藏 1.53MB PDF 举报

"Theory of Active Learning - Steve Hanneke" 主动学习(Active Learning)是一种监督机器学习的方法，其中学习算法会顺序地从大量未标记数据中选择数据点并请求其标签。这种方法与被动学习形成对比，后者是随机选取标记数据。在主动学习的目标中，旨在用尽可能少的标签创建一个高度准确的分类器，理想情况下，所需标签数量少于被动学习达到同样准确度所需的标签数量。本文深入探讨了主动学习理论上的优势及其对设计有效主动学习算法的影响。文章特别关注了一种技术——基于分歧的主动学习(Disagreement-based Active Learning)，该技术已形成了成熟和连贯的文献基础。同时，文章还简要概述了文献中的几种替代方法。重点在于关于几种通用算法性能的定理，包括适当的严格证明。尽管如此，文章的呈现方式旨在教育性，聚焦于展示基本思想的结果，而不是追求最强或最一般化的已知定理。目标读者包括对机器学习和统计学领域中主动学习最新进展有深入了解兴趣的研究人员和高级研究生。随着该领域的不断发展，本文将定期更新，最新的版本可以从作者的网站获取。本文作者Steve Hanneke强调，2014年的这篇文章是在《机器学习基础与趋势》系列中发表的一个简短版本，且该文章的版权归属于S.Hanneke。主动学习的核心思想是通过智能选择需要标记的数据点来提高学习效率。基于分歧的主动学习策略通常涉及选择那些模型之间存在分歧的数据点进行标记，因为这些数据点可能包含对模型改进至关重要的信息。这种策略可以减少对大量标记数据的依赖，从而在有限的标注资源下提升学习效果。此外，文章还可能讨论了其他主动学习策略，如基于查询的策略（query-based strategies），比如不确定性采样（uncertainty sampling）和密度估计（density estimation）。不确定性采样通常选择那些模型预测概率最不确定的数据点，而密度估计则倾向于选择数据分布中密集区域的样本，因为这些区域可能包含更多的类别信息。 "Theory of Active Learning" 提供了对主动学习理论基础的深刻洞察，对于想要深入研究这一领域的学者和实践者来说，是一份宝贵的资源。它不仅介绍了理论背后的数学原理，还提供了指导算法设计的实际应用见解，从而有助于推动机器学习领域的发展。

Basic Deﬁnitions and Notation

We begin by formalizing the active learning setting, deﬁning the quan-

tities that will be the focus of our discussion, and providing a few basic

examples.

2.1 The Setting

We consider the following formal setting throughout this article. There

is a set X called the instance space, equipped with a σ-algebra B

;

for convenience, let us suppose (X, B

) is a standard Borel space (e.g.,

under the usual Borel σ-algebra). Also let Y = {−1, +1}, called

the label space, and suppose X × Y is equipped with its product σ-

algebra B = B

⊗ 2

. Fix a probability measure P

on X × Y,

called the target distribution, denote by P the marginal distribution of

over X, and ∀x ∈ X, denote η(x) = P(Y = +1|X = x), where

(X, Y ) ∼ P

. We refer to any measurable h : X → Y as a classiﬁer.

For any classiﬁer h, deﬁne er(h) = P

((x, y) : h(x) 6= y), called the

error rate; in words, this is the probability that h makes a mistake in

predicting the label Y by h(X), for a random point (X, Y ) ∼ P

Throughout, let us make the usual simplifying assumption that all sets

2.1. The Setting 13

we evaluate the probabilities of, or functions we take expectations of,

are indeed measurable; when this is not the case, one may typically

turn to outer probabilities to maintain validity of the results, but we

will not discuss these technical issues further below.

In this context, we are interested in learning from data: that is,

producing a classiﬁer h with small er(h), based on samples from P

Speciﬁcally, let Z = {(X

, Y

)}

∞

i=1

be a sequence of independent P

distributed random variables, called the labeled data sequence. For

m ∈ N, denote by Z

= {(X

, Y

)}

i=1

the ﬁrst m data points. Also

denote by Z

= {X

}

∞

i=1

the unlabeled data sequence. Though in prac-

tice, the actual sequence of unlabeled data available would typically be

large but ﬁnite, to focus our analysis on the number of label requests

suﬃcient for learning, let us suppose we have access to the entire Z

sequence, representing an inexhaustible source of unlabeled data; the

actual number of unlabeled data points needed by the algorithms be-

low for their respective guarantees to hold can be extracted from their

respective analyses.

In the active learning protocol, the learning algorithm is given a

budget n, and provided direct access to Z

. It may then select any

index i

∈ N and request to observe the label Y

. Upon receiving the

value of Y

, it may then select another index i

, request the label Y

and so on. After a number of these label requests not exceeding the

budget n, the algorithm halts and returns a classiﬁer

h. More formally,

this protocol speciﬁes a family of estimators that map Z to a classi-

ﬁer

h, such that for every P

h is conditionally independent of Z

given Z

and (i

, Y

), . . . , (i

, Y

), where each i

is conditionally in-

dependent of Z given Z

and (i

, Y

), . . . , (i

k−1

, Y

k−1

). In contrast, a

passive learning algorithm is any (possibly randomized, independent

from Z) function A mapping a sequence L ∈

n∈N

(X ×Y)

of labeled

data points to a classiﬁer

h. We are then particularly interested in the

behavior of A(Z

) as a function of n.

14 Basic Deﬁnitions and Notation

2.2 Basic Deﬁnitions

The primary focus in the study of active learning is the label complexity,

deﬁned formally as follows. A label complexity function Λ maps two

values ε, δ ∈ [0, 1] and a distribution P

to a value Λ(ε, δ, P

) ∈

N ∪ {∞}.

Deﬁnition 2.1. For any active learning algorithm A, we say A achieves

label complexity Λ if, for every ε ≥ 0 and δ ∈ [0, 1], every distribution

over X ×Y, and every integer n ≥ Λ(ε, δ, P

), if

h is the classiﬁer

produced by running A with budget n, then with probability at least

1 − δ, er(

h) ≤ ε.

We will be particularly interested in the label complexity of achiev-

ing low error rate relative to the best error rate among a ﬁxed set

C of classiﬁers, known as the hypothesis class. In particular, denoting

ν = inf

h∈C

er(h) (called the noise rate), we are typically interested in

the value of Λ(ν + ε, δ, P

) as a function of ε, δ, and P

. For sim-

plicity, we will suppose the inﬁmum inf

h∈C

er(h) is actually achieved

by a classiﬁer f

∈ C (i.e., er(f

) = ν); otherwise, we could either let

∈ C be a classiﬁer with er(f

) merely close to inf

h∈C

er(h) [as done

by Hanneke, 2011], or let f

be in the closure of C with er(f

) = ν

[following Hanneke, 2012].

For comparison, we will also discuss the label complexity of certain

passive learning algorithms A. We can deﬁne this notion by considering

a very simple type of active learning algorithm, which given budget n,

simply requests the labels Y

, . . . , Y

, and then returns the classiﬁer

produced by A(Z

). We then say A achieves a label complexity Λ

under the same conditions speciﬁed by Deﬁnition 2.1, applied to this

simple active learning algorithm.

Following the classic work of Vapnik and Chervonenkis [1971], for

any m ∈ N and sequence (x

, . . . , x

) ∈ X

, we say a set H of classiﬁers

shatters (x

, . . . , x

) if, for every (y

, . . . , y

) ∈ Y

, ∃h ∈ H s.t. ∀i ∈

{1, . . . , m}, h(x

) = y

; in other words, H shatters (x

, . . . , x

) if all 2

possible classiﬁcations of (x

, . . . , x

) can be realized by classiﬁers in H.

For convenience, deﬁne X

= {()} (where () is the empty sequence),

and say a set H shatters the empty sequence () if and only if H is

2.2. Basic Deﬁnitions 15

nonempty. The Vapnik-Chervonenkis (VC) dimension of a non-empty

set H, denoted vc(H), is deﬁned as the largest integer m such that

∃S ∈ X

shattered by H, or as ∞ if no such value exists. We denote

d = vc(C), and for simplicity, for the vast majority of the article, we

will suppose d < ∞; in particular, many of the results below are stated

in terms of d. We discuss other interesting scenarios, where d may be

inﬁnite, in Section 8.8.

For any set A, let

denote the indicator function for A: that is,

(x) = 1 if x ∈ A, and

(x) = 0 otherwise. We will also sometimes

use the notation

[L], where L is a logical expression (e.g., “f(x) 6= y”),

deﬁning

[L] = 1 if L is true, and

[L] = 0 if L is false. Additionally,

deﬁne the signed indicator function of A as

= 2

− 1. For a

classiﬁer h and a sequence of labeled data points L ∈

m∈N

(X ×Y)

deﬁne the empirical error rate of h with respect to L as er

(h) =

|L|

(x,y)∈L

[h(x) 6= y], representing the fraction of points in L on

which h makes mistakes. For completeness, also deﬁne er

∅

(h) = 0.

Also, when L = Z

, the ﬁrst m labeled data points, for any m ∈

N ∪{0}, abbreviate er

(h) = er

(h); also denote V

= {h ∈ C : ∀i ≤

m, h(X

) = f

)}, called the version space induced by {X

, . . . , X

For any set of classiﬁers H, and any ε ∈ [0, 1], deﬁne the ε-

minimal set as H(ε) = {h ∈ H : er(h) − inf

g∈H

er(g) ≤ ε}; also, for

any classiﬁer h, deﬁne the ε-ball centered at h as B

H,P

(h, ε) =

{g ∈ H : P(x : g(x) 6= h(x)) ≤ ε}; when H = C, the hypothesis class,

abbreviate B

(h, ε) = B

C,P

(h, ε), and when P is clear from the context,

abbreviate B

(h, ε) = B

H,P

(h, ε), and B(h, ε) = B

C,P

(h, ε). Addition-

ally, deﬁne the radius of the set H as radius(H) = sup

h∈H

P(x : h(x) 6=

(x)), which is the smallest ε for which H = B

, ε). Finally, deﬁne

the region of disagreement of H as

DIS(H) = {x ∈ X : ∃h, g ∈ H s.t. h(x) 6= g(x)},

the set of points for which there is some disagreement among classiﬁers

in H regarding their predicted label.

Below, we will study a certain family of active learning algorithms,

based on a general strategy known as disagreement-based active learn-

ing [Cohn, Atlas, and Ladner, 1994, Balcan, Beygelzimer, and Lang-

ford, 2006]. This strategy involves maintaining a set V of candidate

16 Basic Deﬁnitions and Notation

classiﬁers (one of which will be returned in the end), processing the

unlabeled samples in sequence, and requesting the labels Y

of samples

in DIS(V ). This ensures that we request the label of any sample for

which there is some uncertainty about the classiﬁcation the returned

classiﬁer will assign to it. The set V is then periodically updated by

removing classiﬁers with relatively poor performance on the queried

samples. We will discuss this strategy in more detail in Chapter 5, but

even from this rough description, it should be clear that analysis of

its label complexity will necessarily involve characterizing properties

of the regions DIS(V ), for the sets V obtained in the course of the

execution. In particular, since this strategy only requests the labels of

samples in DIS(V ), it will be important to characterize the probability

that a random sample X

is in DIS(V ): that is, P(DIS(V )).

As we will see below, it is often straightforward to express a con-

cise bound on radius(V ), for the sets V obtained in these algorithms.

For this reason, in the interest of obtaining concise bounds on the la-

bel complexity, it will often be convenient to bound P(DIS(V )) by a

homogeneous linear function of a bound on radius(V ). In the context

of active learning, the coeﬃcient in this linear function is typically

referred to as the disagreement coeﬃcient [following Hanneke, 2007b,

2009b]. A nearly-identical quantity has also appeared in the literature

on ratio-type empirical processes [Alexander, 1987, Giné and Koltchin-

skii, 2006], there typically referred to as Alexander’s capacity function.

In both of these contexts, it is essentially used to describe the rate of

collapse of P(DIS(B(f

, ε))) as ε → 0. It is formally deﬁned as follows.

Deﬁnition 2.2. For any r

≥ 0 and classiﬁer h, deﬁne the disagreement

coeﬃcient of h with respect to C under P as

) = sup

r>r

P (DIS (B (h, r)))

∨ 1.

When h = f

, abbreviate this as θ(r

) = θ

), called the disagree-

ment coeﬃcient of the class C with respect to P

Recalling the motivating discussion above, note that, for any V ⊆ C

and r ≥ max{radius(V ), r

}, we have P(DIS(V )) ≤ θ(r

)r, so that the

disagreement coeﬃcient can indeed be used to relate P(DIS(V )) to

剩余225页未读，继续阅读

giscl

粉丝: 0
资源: 15

主动学习理论：探索与优化

uncertainty in deep learning

Mastering Java Machine Learning

AnySCAN: An Efficient Anytime Framework with Active Learning for Large-scale Network Clustering

A Stable and Energy-Efficient Routing Algorithm Based on Learning Automata Theory for MANET

Deep Learning with TensorFlow

Deep learning with tensorflow

Unsupervised.Learning.with.R

Mastering Machine Learning with R - Second Edition

Springer - Genetic Programming Theory and Practice 3.part2.rar

Springer - Genetic Programming Theory and Practice 3.part1.rar

最新资源