无标签数据的自我教学：自监督迁移学习方法

需积分: 23 110 浏览量更新于2024-09-08 收藏 554KB PDF 举报

"Self-taught Learning: Transfer Learning from Unlabeled Data" 是一篇重要的计算机科学领域的研究论文，由 Rajat Raina、Alexis Battle、Honglak Lee、Benjamin Packer 和 Andrew Y. Ng 等斯坦福大学计算机科学系的研究者共同撰写。这篇论文探讨了在迁移学习中如何有效地利用未标注数据来提升有监督分类任务的性能。在传统的机器学习模型中，假设可用的数据集已经明确地被标记和分类，但现实情况往往复杂得多。该论文提出了一种名为“自我教学学习”（Self-taught Learning）的新框架，它打破了对未标注数据必须遵循与标注数据相同类别或生成分布的假设。这意味着研究者们可以利用互联网上大量随意下载的未标注图像、音频样本或文本数据，如图片、音乐片段或文档，来改进对特定图像、音频或文本分类任务的识别精度。自我教学学习的一个关键点在于其能够处理非结构化的大量未标注数据，这在典型的半监督学习或迁移学习环境中是难以实现的。相比于获取标注数据，未标注数据的获取更为便捷，使得这种方法在解决许多实际问题时展现出广泛的应用潜力。论文的核心技术之一是使用稀疏编码（Sparse Coding），这是一种将未标注数据转化为更高层次特征的方法，通过这种方式，模型可以从这些丰富的无标签信息中学习到有价值的知识，并将其迁移到有监督的任务中。通过稀疏编码，模型能够自动发现并学习数据的潜在结构，即使在缺乏明确标签的情况下也能进行有效的特征提取。这种方法的优势在于它能够适应各种类型的数据，无需预先假设它们的内在联系，从而提高了模型的泛化能力。"Self-taught Learning: Transfer Learning from Unlabeled Data"这篇论文为如何在海量无标签数据中挖掘知识，提升机器学习模型的性能提供了一种创新且实用的策略，对于推动迁移学习和无监督学习的研究具有重要意义。

Self-taught Learning: Transfer Learning from Unlabeled Data

Rajat Raina rajatr@cs.stanford.edu

Alexis Battle ajbattle@cs.stanford.edu

Honglak Lee hllee@cs.stanford.edu

Benjamin Packer bpacker@cs.stanford.edu

Andrew Y. Ng ang@cs.stanford.edu

Computer Science Department, Stanford University, CA 94305 USA

Abstract

We present a new machine learning frame-

work called “self-taught learning” for using

unlabeled data in supervised classiﬁcation

tasks. We do not assume that the unla-

beled data follows the same class labels or

generative distribution as the labeled data.

Thus, we would like to use a large number

of unlabeled images (or audio samples, or

text documents) randomly downloaded from

the Internet to improve performance on a

given image (or audio, or text) classiﬁcation

task. Such unlabeled data is signiﬁcantly eas-

ier to obtain than in typical semi-supervised

or transfer learning settings, making self-

taught learning widely applicable to many

practical learning problems. We describe an

approach to self-taught learning that uses

sparse coding to construct higher-level fea-

tures using the unlabeled data. These fea-

tures form a succinct input representation

and signiﬁcantly improve classiﬁcation per-

formance. When using an SVM for classiﬁ-

cation, we further show how a Fisher kernel

can be learned for this representation.

1. Introduction

Labeled data for machine learning is often very diﬃ-

cult and expensive to obtain, and thus the ability to

use unlabeled data holds signiﬁcant promise in terms

of vastly expanding the applicability of learning meth-

ods. In this paper, we study a novel use of unlabeled

data for improving performance on supervised learn-

ing tasks. To motivate our discussion, consider as a

running example the computer vision task of classi-

fying images of elephants and rhinos. For this task,

it is diﬃcult to obtain many labeled examples of ele-

phants and rhinos; indeed, it is diﬃcult even to obtain

many unlabeled examples of elephants and rhinos. (In

fact, we ﬁnd it diﬃcult to envision a process for col-

lecting such unlabeled images, that does not immedi-

Appearing in Proceedings of the 24

International Confer-

ence on Machine Learning, Corvallis, OR, 2007. Copyright

2007 by the author(s)/owner(s).

ately also provide the class labels.) This makes the

classiﬁcation task quite hard with existing algorithms

for using labeled and unlabeled data, including most

semi-supervised learning algorithms such as the one

by Nigam et al. (2000). In this paper, we ask how un-

labeled images from other object classes—which are

much easier to obtain than images speciﬁcally of ele-

phants and rhinos—can be used. For example, given

unlimited access to unlabeled, randomly chosen im-

ages downloaded from the Internet (probably none of

which contain elephants or rhinos), can we do better

on the given supervised classiﬁcation task?

Our approach is motivated by the observation that

even many randomly downloaded images will contain

basic visual patterns (such as edges) that are similar

to those in images of elephants and rhinos. If, there-

fore, we can learn to recognize such patterns from the

unlabeled data, these patterns can be used for the su-

pervised learning task of interest, such as recognizing

elephants and rhinos. Concretely, our approach learns

a succinct, higher-level feature representation of the in-

puts using unlabeled data; this representation makes

the classiﬁcation task of interest easier.

Although we use computer vision as a running exam-

ple, the problem that we pose to the machine learning

community is more general. Formally, we consider

solving a supervised learning task given labeled and

unlabeled data, where the unlabeled data does not

share the class labels or the generative distribution of

the labeled data. For example, given unlimited access

to natural sounds (audio), can we perform better

speaker identiﬁcation? Given unlimited access to news

articles (text), can we perform better email foldering

of “ICML reviewing” vs. “NIPS reviewing” emails?

Like semi-supervised learning (Nigam et al., 2000),

our algorithms will therefore use labeled and unlabeled

data. But unlike semi-supervised learning as it is typ-

ically studied in the literature, we do not assume that

the unlabeled data can be assigned to the supervised

learning task’s class labels. To thus distinguish our

formalism from such forms of semi-supervised learn-

ing, we will call our task self-taught learning.

There is no prior general, principled framework for

incorporating such unlabeled data into a supervised

759

下载后可阅读完整内容，剩余7页未读，立即下载

wdqkdzz

粉丝: 0

无标签数据的自我教学：自监督迁移学习方法

Self-taught Learning: Transfer Learning from Unlabeled Data

UFLDL：Self-Taught Learning

UFLDL教程练习题答案——self-taught learning

无监督CNN分类算法有哪些

while not converged do

self taught clustering

大学生家教服务平台前端单元测试怎么写

Assuming that all modules are mandatory, produce a list of all students (containing their IDs, first and last names) taught by Brett from the "Students" and "Modules" tables. Write your answer using a single SQL statement.

ctf grafana

最新资源