利用未标记数据提取特权信息提升分类器性能

需积分: 5 5 浏览量更新于2024-07-10 收藏 5.48MB PDF 举报

"提取特权信息以增强分类器学习" 在机器学习领域，数据的质量和数量对模型的准确性至关重要。然而，实际情况中，训练数据往往受限于量的不足或质量的低下。特权信息（Privileged Information, PI）作为一种额外的、有助于提升分类器性能的信息，如属性、标签或特性，通常被用来改善学习过程。例如，手动标注的属性可以提供更丰富的上下文信息，帮助模型更好地理解数据。但是，手动标注的过程既耗时又费力，而且受限于个人知识，可能会导致特权信息不够全面。针对这些问题，本文提出了一种从未标记数据中自动探索特权信息来增强分类器学习的方法，旨在减少对人工标注数据的依赖并获取更为丰富的信息。具体来说，研究者将每个选取的特权信息视为一个子类别，并为每个子类别独立学习一个分类器。这些子类别的分类器随后会集成在一起，形成一个更为强大的类别分类器。这种方法的核心思想是利用未标记数据中的潜在结构和模式，以无监督或弱监督的方式挖掘出有价值的特权信息。在论文中，作者Yazhou Yao、Fumin Shen、Jian Zhang等人探讨了如何有效地从未经标注的语料库中提取这些信息。他们可能采用了某种形式的半监督学习或者自监督学习策略，通过分析数据内在的分布和关系，推断出代表性的特征。这种特性使得模型能够在没有大量人工标注的情况下，依然能够学习到数据的深层次特征，从而提高整体的分类性能。此外，文章可能还涉及了如何评估和验证这种方法的有效性，可能包括在各种基准数据集上的实验，比较有无特权信息情况下分类器的性能差异，以及与现有方法的对比。这通常涉及到精确率、召回率、F1分数等指标的计算，以量化模型的分类效果。这篇研究论文提出了一个创新的解决方案，以自动化的方式从海量未标记数据中挖掘特权信息，减轻了对人工标注的依赖，增强了分类器的学习能力，对于提升机器学习模型在有限或低质量训练数据条件下的表现具有重要的理论和实际意义。

438 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 1, JANUARY 2019

Fig. 2. System overview. Our proposed approach mainly consists of three major steps. Namely, discovering PI, purifying PI, and learning integrated classiﬁer.

proposed in [12]. However, both of [7] and [12] only use the

visual features. In our work, we associate the “bags” with

textual privileged information and propose a new MIL model

to select images and learn the optimal classiﬁers.

Our work is primarily inspired by the following work.

A visual concept learning system was recently proposed

in [15] and achieved impressive accuracy for object detec-

tion. It discovers an exhaustive vocabulary explaining all the

appearance variations from Google Books Corpora, and trains

full-ﬂedged detection models for it. The differences between

us lie in two aspects. First, we adopt different approaches to

ﬁlter out the noisy privileged information. Second, we lever-

aged different strategies to purify the collected web images.

Compared to method [15] which takes an iterative mecha-

nism in the process of noisy images pruning, our method

formulates noisy images removing as an instance-level multi-

instance learning problem. In this way, images from different

distributions can be kept while noise is ﬁltered out.

III. F

RAMEWORK AND METHODS

We seek to automate the process of learning robust classi-

ﬁers directly from the web data without human intervention.

As shown in Fig 2, our proposed approach mainly consists

of three major steps. Namely, discovering privileged informa-

tion, purifying privileged information, and learning integrated

classiﬁer. In the next, we will give the details of each step.

For ease of presentation, we denote each instance as x

with

its label y

and each bag G

with the label Y

. A matrix/vector

is denoted by an uppercase/lowercase letter in boldface. The

transpose of a vector or matrix is represented by



.More-

over, we denote the indicator function as λ(a=b), in which

λ(a=b)=0ifa= b, and λ(a=b)=1, otherwise.

A. Discovering Privileged Information

Inspired by recent works [15], [21], we can use Google

Books Corpora [27] to discover an exhaustive vocabulary

explaining all the appearance variations for the given category.

Compared to manually labeled WordNet [44] and Concept-

Net [49], it is not only much richer but also more general and

exhaustive.

Following [27] (see section 4.3), we speciﬁcally treat the

dependency gram data with parts-of-speech (POS) as the

privileged information. For example, given a category (e.g.,

“horse”) and its corresponding POS tag (e.g., ‘jumping,

VERB’), we ﬁnd all its occurrences annotated with POS

tag within the dependency gram data. Of all the n-gram

dependencies retrieved for the given category, we choose

those whose modiﬁers are NOUN, VERB, ADJECTIVE, and

ADVERB as the discovered privileged information.

B. Purifying Privileged Information

Not all the discovered privileged information is useful, and

some noise may also be included. In the second PI purifying

step, the so-called “noise” here is the text PI from untagged

corpora (e.g., the bold privileged information in Table I). Using

noisy privileged information to enhance classiﬁer learning will

hurt both of the accuracy and robustness. To this end, we need

to separate useful privileged information from noise before

learning classiﬁers.

Our basic idea is to ﬁlter out the noisy privileged informa-

tion from the perspective of relevance. Speciﬁcally, we denote

the semantic distance of all discovered privileged information

by a graph in which the given category (e.g., “dog”) is center

y. Other discovered privileged information has a score S

corresponds to the Normalized Google Distance (NGD) [2]

剩余14页未读，继续阅读

weixin_38502929

粉丝: 7
资源: 959

利用未标记数据提取特权信息提升分类器性能

几种特征提取所使用的分类器

模式识别作业,包括线性分类器；最小风险贝叶斯分类器；监督学习法分层聚类分析；K－L变换提取有效特征,支持向量机

Halcon视觉检测——使用分类器分类

基于掩码信息提取roi子图并构建基于迁移学习的图像分类器

pytorch建模，一个特征提取器两个分类器，通过第一个分类器选出最大分类概率超过0.5的样本用于第二个分类器，代码怎么写

SVC提取特征后反而分类性能下降

基于多特征提取和svm 分类器的 纹理图像分类

关于性别的人脸识别系统设计,特征提取算法 FLDA,分类器Bayes分类器。

对最后，使用分类器对提取到的特征进行分类。常用的分类器有，支持向量机、朴素贝叶斯算法、神经网络等。这句话进行扩写

最新资源

基于多特征提取和svm 分类器的纹理图像分类