2008年Xiaojin Zhu的半监督学习综述：关键方法与进展

4星 · 超过85%的资源需积分: 50 85 浏览量更新于2024-07-31 1 收藏 539KB PDF 举报

半监督学习（Semi-Supervised Learning Literature Survey）是计算机科学领域的一个重要研究方向，它关注在数据集中只有部分样本有标记的情况下，如何利用未标记数据提高模型的性能和泛化能力。这篇综述由Xiaojin Zhu在2008年撰写，针对的是当时的研究现状和主要方法论，旨在为读者提供一个深入理解半监督学习的框架。首先，文章提出了半监督学习的常见问题（FAQ），涵盖了为何在数据标记不足时仍能有效学习、以及这种方法的优势和限制。关键在于如何利用未标记数据来克服标记数据的稀疏性。接下来，讨论了生成模型（Generative Models），包括模型的可识别性（Identifiability）、模型的正确性（Model Correctness）和局部极大值（EMLocal Maxima）问题。这些模型试图通过构建数据的概率分布来理解和预测未标记样本，例如通过潜在类别变量进行聚类和标注。鱼叉核函数（Fishер kernel）则被用于将半监督学习应用于更偏向于判别任务的学习方法中，这种技术能够将非线性特征转换为线性可分离的表示，从而提高分类性能。自我训练（Self-Training）是一种常见的半监督策略，它通过初始模型对未标记数据进行预测，然后用这些预测结果作为新的训练样本来迭代提升模型。这种方法强调了模型的自我学习能力。 Co-Training和多视图学习（Co-Training and Multi-view Learning）是另一种协作学习的方法，通过不同视角对同一数据集进行分析，以增强模型的鲁棒性和准确性。Co-Training关注两个或多个互相独立的特征子集之间的联合学习，而Multi-view Learning则更广泛地探索数据的不同表现形式。避免在稠密区域修改模型（Avoiding Changes in Dense Regions）是半监督学习中的一个重要挑战，文章探讨了如何在保持模型稳定的同时，有效地利用未标记数据。这包括转导支持向量机（Transductive SVMs，S3VMs）、高斯过程（Gaussian Processes）、信息正则化（Information Regularization）、熵最小化（Entropy Minimization）等策略，以及与图模型的关联。图基方法（Graph-Based Methods）是半监督学习的另一大分支，它将数据视为图结构，通过节点间的相似性或关系进行建模。这些方法涉及图的正则化（如Mincut、Markov随机场、Gaussian随机场和Harmonic Functions）、局部和全局一致性、Tikhonov正则化、Manifold Regularization，以及基于谱理论的图核（Graph Kernels from the Spectrum of Laplacian）和谱图转换器（Spectral Graph Transducer）。总结来说，这篇综述深入剖析了半监督学习的各种核心方法和技术，展示了在缺乏大量标记数据的情况下，如何巧妙利用未标记数据进行模型训练和优化，是研究者和实践者理解这一领域的重要参考资料。

−6 −4 −2 0 2 4 6

−6

−4

−2

Class 1

Class 2

−6 −4 −2 0 2 4 6

−6

−4

−2

−6 −4 −2 0 2 4 6

−6

−4

−2

(a) Horizontal class separation (b) High probability (c) Low probability

Figure 3: If the model is wrong, higher likelihood may lead to lower classiﬁcation

accuracy. For example, (a) is clearly not generated from two Gaussian. If we insist

that each class is a single Gaussian, (b) will have higher probability than (c). But

(b) has around 50% accuracy, while (c)’s is much better.

2.3 EM Local Maxima

Even if the mixture model assumption is correct, in practice mixture components

are identiﬁed by the Expectation-Maximization (EM) algorithm (Dempster et al.,

1977). EM is prone to local maxima. If a local maximum is far from the global

maximum, unlabeled data may again hurt learning. Remedies include smart choice

of starting point by active learning (Nigam, 2001).

2.4 Cluster-and-Label

We shall also mention that instead of using an probabilistic generative mixture

model, some approaches employ various clustering algorithms to cluster the whole

dataset, then label each cluster with labeled data, e.g. (Demiriz et al., 1999) (Dara

et al., 2002). Although they can perform well if the particular clustering algorithms

match the true data distribution, these approaches are hard to analyze due to their

algorithmic nature.

2.5 Fisher kernel for discriminative learning

Another approach for semi-supervised learning with generative models is to con-

vert data into a feature representation determined by the generative model. The new

feature representation is then fed into a standard discriminative classiﬁer. Holub

et al. (2005) used this approach for image categorization. First a generative mix-

ture model is trained, one component per class. At this stage the unlabeled data can

be incorporated via EM, which is the same as in previous subsections. However

instead of directly using the generative model for classiﬁcation, each labeled ex-

ample is converted into a ﬁxed-length Fisher score vector, i.e. the derivatives of log

likelihood w.r.t. model parameters, for all component models (Jaakkola & Haus-

sler, 1998). These Fisher score vectors are then used in a discriminative classiﬁer

like an SVM, which empirically has high accuracy.

3 Self-Training

Self-training is a commonly used technique for semi-supervised learning. In self-

training a classiﬁer is ﬁrst trained with the small amount of labeled data. The

classiﬁer is then used to classify the unlabeled data. Typically the most conﬁdent

unlabeled points, together with their predicted labels, are added to the training

set. The classiﬁer is re-trained and the procedure repeated. Note the classiﬁer

uses its own predictions to teach itself. The procedure is also called self-teaching

or bootstrapping (not to be confused with the statistical procedure with the same

name). The generative model and EM approach of section 2 can be viewed as a

special case of ‘soft’ self-training. One can imagine that a classiﬁcation mistake

can reinforce itself. Some algorithms try to avoid this by ‘unlearn’ unlabeled points

if the prediction conﬁdence drops below a threshold.

Self-training has been applied to several natural language processing tasks.

Yarowsky (1995) uses self-training for word sense disambiguation, e.g. deciding

whether the word ‘plant’ means a living organism or a factory in a give context.

Riloff et al. (2003) uses it to identify subjective nouns. Maeireizo et al. (2004)

classify dialogues as ‘emotional’ or ‘non-emotional’ with a procedure involving

two classiﬁers.Self-training has also been applied to parsing and machine transla-

tion. Rosenberg et al. (2005) apply self-training to object detection systems from

images, and show the semi-supervised technique compares favorably with a state-

of-the-art detector.

Self-training is a wrapper algorithm, and is hard to analyze in general. How-

ever, for speciﬁc base learners, there has been some analyzer’s on convergence.

See e.g. (Haffari & Sarkar, 2007; Culp & Michailidis, 2007).

4 Co-Training and Multiview Learning

4.1 Co-Training

Co-training (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that (i) features

can be split into two sets; (ii) each sub-feature set is sufﬁcient to train a good

classiﬁer; (iii) the two sets are conditionally independent given the class. Initially

two separate classiﬁers are trained with the labeled data, on the two sub-feature

sets respectively. Each classiﬁer then classiﬁes the unlabeled data, and ‘teaches’ the

other classiﬁer with the few unlabeled examples (and the predicted labels) they feel

−

(a) x

view (b) x

view

Figure 4: Co-Training: Conditional independent assumption on feature split. With

this assumption the high conﬁdent data points in x

view, represented by circled

labels, will be randomly scattered in x

view. This is advantageous if they are to

be used to teach the classiﬁer in x

view.

most conﬁdent. Each classiﬁer is retrained with the additional training examples

given by the other classiﬁer, and the process repeats.

In co-training, unlabeled data helps by reducing the version space size. In other

words, the two classiﬁers (or hypotheses) must agree on the much larger unlabeled

data as well as the labeled data.

We need the assumption that sub-features are sufﬁciently good, so that we can

trust the labels by each learner on U. We need the sub-features to be conditionally

independent so that one classiﬁer’s high conﬁdent data points are iid samples for

the other classiﬁer. Figure 4 visualizes the assumption.

Nigam and Ghani (2000) perform extensive empirical experiments to compare

co-training with generative mixture models and EM. Their result shows co-training

performs well if the conditional independence assumption indeed holds. In addi-

tion, it is better to probabilistically label the entire U , instead of a few most con-

ﬁdent data points. They name this paradigm co-EM. Finally, if there is no natural

feature split, the authors create artiﬁcial split by randomly break the feature set into

two subsets. They show co-training with artiﬁcial feature split still helps, though

not as much as before. Collins and Singer (1999); Jones (2005) used co-training,

co-EM and other related methods for information extraction from text. Balcan and

Blum (2006) show that co-training can be quite effective, that in the extreme case

only one labeled point is needed to learn the classiﬁer. Zhou et al. (2007) give a

co-training algorithm using Canonical Correlation Analysis which also need only

one labeled point. Dasgupta et al. (Dasgupta et al., 2001) provide a PAC-style

theoretical analysis.

Co-training makes strong assumptions on the splitting of features. One might

wonder if these conditions can be relaxed. Goldman and Zhou (2000) use two

learners of different type but both takes the whole feature set, and essentially use

剩余59页未读，继续阅读

ght1102

粉丝: 1

2008年Xiaojin Zhu的半监督学习综述：关键方法与进展

半监督学习综述(a survey of semi-supervised learning)

Python-Tensorflow中的半监督学习GAN

半监督学习方法.pdf

Semi-Supervised Classification with Graph Convolutional Networks

temporal ensembling for semi-supervised learning

guided collaborative training for pixel-wise semi-supervised learning

SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS 代码

semi-supervised hierarchical recurrent graphneural network for city-wide par

SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

DSL: Dense Learning based Semi-Supervised Object Detection代码复现教程

最新资源