利用元学习弥合深度学习对大量标注数据的依赖

需积分: 9 48 浏览量更新于2024-07-17 收藏 22.01MB PDF 举报

"《Larochelle的Meta-Learning：迈向少样本学习的通用化》是一篇深入探讨深度学习在面临数据稀缺挑战时如何利用元学习进行解决方案的英文论文。作者Hugo Larochelle，来自Google Brain，提出了一个研究议程，针对机器学习和人工智能领域中的一个重要问题：深度学习的成功依赖于大量的标注数据，这不仅需要大量的人力投入，还引发了对人类学习能力与AI未来发展的思考。首先，文章指出，深度学习模型的成功往往建立在海量标注数据的基础上，而这些数据的获取和标注过程消耗了大量的人力资源。实践中，这引出了一个问题：我们是否真的能仅凭这种方法解决人工智能的广泛问题？科学上，这意味着我们需要理解人类学习与现有AI模型之间的差距，寻找更高效的学习方式。 Larochelle提出的一个替代方案是利用那些虽然不完美但丰富的其他数据源，包括无监督学习（Unsupervised Learning），其中模型可以从未标记的数据中自行发现模式；多模态学习（Multimodal Learning），即处理多种类型的信息输入，如图像、文本和声音，以提高模型的理解和泛化能力；以及迁移学习（Transfer Learning）和领域适应（Domain Adaptation），通过在不同领域之间共享知识，使得模型能够更好地应对新环境下的挑战。这篇论文的核心议题在于，如何通过元学习来缩小机器与人类学习之间的差距，以及如何设计和开发算法，使得AI能够在有限的训练数据下展现出更好的泛化性能。它为我们提供了一个视角，即在未来的AI发展中，可能需要更加注重数据的有效利用和学习策略的创新，以实现真正的通用化和适应性。通过深入研究元学习的方法和技术，或许能开启人工智能的新篇章，推动其在实际应用中的突破。"

RELATED WORK: META-LEARNING

•

Early work on learning an update rule

‣

Learning a synaptic learning rule (1990)"

Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier

‣

The Evolution of Learning: An Experiment in Genetic Connectionism (1990)"

David Chalmers

‣

On the search for new learning rules for ANNs (1995)"

Samy Bengio, Yoshua Bengio, and Jocelyn Cloutier

•

Early work on recurrent networks modifying their weights

‣

Learning to control fast-weight memories: An alternative to dynamic recurrent

networks (1992)"

Jürgen Schmidhuber

‣

A neural network that embeds its own meta-levels (1993)"

Jürgen Schmidhuber

RELATED WORK: META-LEARNING

•

Training a recurrent neural network to optimize

‣

outputs update, so can decide to do something else than gradient descent

•

Learning to learn by gradient descent by gradient descent (2016)"

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas

•

Learning to learn using gradient descent (2001)"

Sepp Hochreiter, A. Steven Younger, and Peter R. Conwell

Optimizee

Optimizer

t-2 t-1 t

m m m

+ + +

t-1

t-2

∇

t-2

∇

t-1

∇

t-2

t-1

t+1

t-1

t-2

t-1

t+1

t-2

Figure 2: Computational graph used for computing the gradient of the optimizer.

2.1 Coordinatewise LSTM optimizer

One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of

thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it

would require a huge hidden state and an enormous number of parameters. To avoid this difﬁculty we

will use an optimizer

which operates coordinatewise on the parameters of the objective function,

similar to other common update rules like RMSprop and ADAM. This coordinatewise network

architecture allows us to use a very small network that only looks at a single coordinate to deﬁne the

optimizer and share optimizer parameters across different parameters of the optimizee.

Different behavior on each coordinate is achieved by using separate activations for each objective

function parameter. In addition to allowing us to use a small network for this optimizer, this setup has

the nice effect of making the optimizer invariant to the order of parameters in the network, since the

same update rule is used independently on each coordinate.

LSTM

∇

…

∇

…

Figure 3: One step of an LSTM optimizer. All

LSTMs have shared parameters, but separate hid-

den states.

We implement the update rule for each coordi-

nate using a two-layer Long Short Term Memory

(LSTM) network [Hochreiter and Schmidhuber,

1997], using the now-standard forget gate archi-

tecture. The network takes as input the opti-

mizee gradient for a single coordinate as well

as the previous hidden state and outputs the up-

date for the corresponding optimizee parameter.

We will refer to this architecture, illustrated in

Figure 3, as an LSTM optimizer.

The use of recurrence allows the LSTM to learn

dynamic update rules which integrate informa-

tion from the history of gradients, similar to

momentum. This is known to have many desir-

able properties in convex optimization [see e.g.

Nesterov, 1983] and in fact many recent learning procedures—such as ADAM—use momentum in

their updates.

Preprocessing and postprocessing

Optimizer inputs and outputs can have very different magni-

tudes depending on the class of function being optimized, but neural networks usually work robustly

only for inputs and outputs which are neither very small nor very large. In practice rescaling inputs

and outputs of an LSTM optimizer using suitable constants (shared across all timesteps and functions

) is sufﬁcient to avoid this problem. In Appendix A we propose a different method of preprocessing

inputs to the optimizer inputs which is more robust and gives slightly better performance.

RELATED WORK: META-LEARNING

•

Hyper-parameter optimization

‣

idea of learning the"

learning rates and"

the initialization conditions

Gradient-based Hyperparameter Optimization through Reversible Learning

the missing steps of the training procedure (forwards) as

needed during the backward pass. However, this would re-

quire too much memory to be practical for large neural nets

trained for thousands of minibatches.

3. Experiments

In typical machine learning applications, only a few hyper-

parameters (less than 20) are optimized. Since each ex-

periment only yields a single number (the validation loss),

the search rapidly becomes more difﬁcult as the dimen-

sion of the hyperparameter vector increases. In contrast,

when hypergradients are available, the amount of informa-

tion gained from each training run grows along with the

number of hyperparameters, allowing us to optimize thou-

sands of hyperparameters. How can we take advantage of

this new ability?

This section shows several proof-of-concept experiments in

which we can more richly parameterize training and regu-

larization schemes in ways that would have been previously

impractical to optimize.

3.1. Gradient-based optimization of gradient-based

optimization

Modern neural net training procedures often employ var-

ious heuristics to set learning rate schedules, or set their

shape using one or two hyperparameters set by cross-

validation (Dahl et al., 2014; Sutskever et al., 2013). These

schedule choices are supported by a mixture of intuition,

arguments about the shape of the objective function, and

empirical tuning.

To more directly shed light on good learning rate schedules,

we jointly optimized separate learning rates for every sin-

gle learning iteration of training of a deep neural network,

as well as separately for weights and biases in each layer.

Each meta-iteration trained a network for 100 iterations of

SGD, meaning that the learning rate schedules were spec-

iﬁed by 800 hyperparameters (100 iterations ⇥ 4 layers ⇥

2 types of parameters). To avoid learning an optimization

schedule that depended on the quirks of a particular random

initialization, each evaluation of hypergradients used a dif-

ferent random seed. These random seeds were used both to

initialize network weights and to choose mini batches. The

network was trained on 10,000 examples of MNIST, and

had 4 layers, of sizes 784, 50, 50, and 50.

Because learning schedules can implicitly regularize net-

works (Erhan et al., 2010), for example by enforcing early

stopping, for this experiment we optimized the learning rate

schedules on the training error rather than on the validation

set error. Figure 2 shows the results of optimizing learning

rate schedules separately for each layer of a deep neural

network. When Bayesian optimization was used to choose

Optimized learning rate schedule

     





























Figure 2. A learning-rate training schedule for the weights in each

layer of a neural network, optimized by hypergradient descent.

The optimized schedule starts by taking large steps only in the

topmost layer, then takes larger steps in the ﬁrst layer. All layers

take smaller step sizes in the last 10 iterations. Not shown are

the schedules for the biases or the momentum, which showed less

structure.

Elementary learning curves Meta-learning curve

Figure 3. Elementary and meta-learning curves. The meta-

learning curve shows the training loss at the end of each elemen-

tary iteration.

a ﬁxed learning rate for all layers and iterations, it chose a

learning rate of 2.4.

Meta-optimization strategies We experimented with

several standard stochastic optimization methods for meta-

optimization, including SGD, RMSprop (Tieleman & Hin-

ton, 2012), and minibatch conjugate gradients. The results

in this section used Adam (Kingma & Ba, 2014), a variant

of RMSprop that includes momentum. We typically ran for

50 meta-iterations, and used a meta-step size of 0.04. Fig-

ure 3 shows the elementary and meta-learning curves that

generated the hyperparameters shown in Figure 2.

How smooth are hypergradients? To demonstrate that

the hypergradients are smooth with respect to time steps

in the training schedule, Figure 4 shows the hypergradient

with respect to the step size training schedule at the begin-

ning of training, averaged over 100 random seeds.

•

Gradient-based hyperparameter optimization through reversible learning (2015)"

Dougal Maclaurin, David Duvenaud, and Ryan P Adams

剩余67页未读，继续阅读

xcmax

粉丝: 27
资源: 70

利用元学习弥合深度学习对大量标注数据的依赖

Hugo Larochelle - Neural networks

la-rochelle.zip

https://arxiv.org/abs/1302.0081的文献模式

三维重建-基于Matlab实现结构光三维重建算法-优质项目分享.zip

云网络验证系统云验证+卡密生成+多应用多用户管理

毕业设计论文SpringBoot社区待就业人员信息管理系统.docx

爱心商城系统 JAVA毕业设计 源码+数据库+论文 Vue.js+SpringBoot+MySQL.zip

微积分极限、线代行列式小测.zip

一个自己使用良好的音频文件编辑、播放的软件

毕业设计论文SpringBoot美容美发管理系统.docx

最新资源

爱心商城系统 JAVA毕业设计源码+数据库+论文 Vue.js+SpringBoot+MySQL.zip