深度学习与元学习：逼近任何学习算法的能力

需积分: 9 135 浏览量更新于2024-07-15 收藏 876KB PDF 举报

"这篇论文‘论文笔记—meta-learning and universality’是ICLR 2018会议上发表的一篇文章，探讨了元学习的通用性以及深度表示与梯度下降如何共同近似任何学习算法。" 元学习是一种强大的机器学习范式，它使模型能够通过学习学习过程本身来更高效地从数据中提取知识。这种思想的目标是让模型不仅适应单一任务，还能适应一系列相关任务。论文中提到的两种主要元学习方法是： 1. 循环模型：这种模型接受训练数据集作为输入，然后输出学习模型的参数或对新测试输入的预测。这种方法的核心是模型能够根据之前的经验调整其内部状态，从而适应不同的任务。 2. 深度表示与梯度下降：另一种元学习策略侧重于获取可以使用标准梯度下降法有效微调的深度表示。这种方法的吸引力在于，它可以利用深度神经网络的强大表达能力，快速适应新任务，而无需像循环模型那样从头开始生成参数。论文作者从通用性的角度出发，对这两种方法进行了比较。他们提出并形式化了学习算法近似的概念，试图理解深度学习模型的表示能力和嵌入梯度下降的元学习器之间的关系。研究的关键问题是：深度表示结合梯度下降是否足以近似任何学习算法？通过实验，他们发现深度表示与梯度下降的组合确实具有高度的通用性。这种结合不仅能够近似各种学习算法，而且在实验中表现出比递归模型更广泛的泛化学习策略。这意味着基于梯度的元学习方法在某些情况下可能优于递归模型，因为它能提供更灵活、更适应新任务的学习策略。此外，论文还可能涵盖了以下几个方面的内容： - 元学习的理论基础：论文可能会深入探讨元学习的理论框架，包括学习算法的可学习性、泛化能力以及如何度量不同元学习策略的性能。 - 实验设计：为了验证提出的理论，作者可能设计了一系列实验，这些实验可能涉及不同的数据集、任务类型和学习算法，以全面评估深度表示与梯度下降的结合效果。 - 应用案例：论文可能展示了元学习在实际问题中的应用，如图像分类、自然语言处理或强化学习，以证明其通用性和有效性。 - 未来方向：最后，论文可能会讨论这个领域的未来研究方向，包括如何改进现有的元学习方法、探索更高效的表示学习机制，以及如何将元学习应用于更复杂的任务和环境中。这篇论文深入研究了元学习的通用性，为理解深度学习模型如何通过梯度下降实现通用学习算法提供了新的视角，这对于推动元学习领域的发展具有重要意义。

Published as a conference paper at ICLR 2018

Figure 1: A deep fully-connected neural network with N+2 layers and ReLU nonlinearities. With this generic

fully connected network, we prove that, with a single step of gradient descent, the model can approximate any

function of the dataset and test input.

4 UNIVERSALITY OF THE ONE-SHOT GRADIENT-BASED LEARNER

We ﬁrst introduce a proof of the universality of gradient-based meta-learning for the special case

with only one training point, corresponding to one-shot learning. We denote the training datapoint

as (x, y), and the test input as x

. A universal learning algorithm approximator corresponds to the

ability of a meta-learner to represent any function f

target

(x, y, x

) up to arbitrary precision.

We will proceed by construction, showing that there exists a neural network function

f(·; θ) such that

f(x

; θ

) approximates f

target

(x, y, x

) up to arbitrary precision, where θ

= θ − α∇

`(y, f(x))

and α is the non-zero learning rate. The proof holds for a standard multi-layer ReLU network,

provided that it has sufﬁcient depth. As we discuss in Section 6, the loss function ` cannot be any

loss function, but the standard cross-entropy and mean-squared error objectives are both suitable.

In this proof, we will start by presenting the form of

f and deriving its value after one gradient

step. Then, to show universality, we will construct a setting of the weight matrices that enables

independent control of the information ﬂow coming forward from x and x

, and backward from y.

We will start by constructing

f, which, as shown in Figure 1 is a generic deep network with N + 2

layers and ReLU nonlinearities. Note that, for a particular weight matrix W

at layer i, a single

gradient step W

− α∇

` can only represent a rank-1 update to the matrix W

. That is because

the gradient of W

is the outer product of two vectors, ∇

` = a

i−1

, where a

is the error

gradient with respect to the pre-synaptic activations at layer i, and b

i−1

is the forward post-synaptic

activations at layer i − 1. The expressive power of a single gradient update to a single weight matrix

is therefore quite limited. However, if we sequence N weight matrices as

i=1

, corresponding

to multiple linear layers, it is possible to acquire a rank-N update to the linear function represented

by W =

i=1

. Note that deep ReLU networks act like deep linear networks when the input and

pre-synaptic activations are non-negative. Motivated by this reasoning, we will construct

f(·; θ) as a

deep ReLU network where a number of the intermediate layers act as linear layers, which we ensure

by showing that the input and pre-synaptic activations of these layers are non-negative. This allows

us to simplify the analysis. The simpliﬁed form of the model is as follows:

f(·; θ) = f

out

i=1

φ(·; θ

, θ

); θ

out

where φ(·; θ

, θ

) represents an input feature extractor with parameters θ

and a scalar bias transfor-

mation variable θ

i=1

is a product of square linear weight matrices, f

out

(·, θ

out

) is a function

at the output, and the learned parameters are θ := {θ

, θ

, {W

}, θ

out

}. The input feature extractor

and output function can be represented with fully connected neural networks with one or more hid-

den layers, which we know are universal function approximators, while

i=1

corresponds to a

set of linear layers with non-negative input and activations.

Next, we derive the form of the post-update prediction

f(x

; θ

). Let z =



i=1



φ(x; θ

, θ

and the error gradient ∇

` = e(x, y). Then, the gradient with respect to each weight matrix W

is:

∇

`(y,

f(x, θ)) =





i−1

j=1





e(x, y)φ(x; θ

, θ

)





j=i+1





Therefore, the post-update value of

i=1

− α∇

`) is given by

i=1

−α

i=1





i−1

j=1









i−1

j=1





e(x, y)φ(x; θ

, θ

)





j=i+1









j=i+1





−O(α

剩余19页未读，继续阅读

liz_lee

粉丝: 70
资源: 36

深度学习与元学习：逼近任何学习算法的能力

元学习（meta learning）综述论文（2020年）

9-Universality and uncomputability 2022-11-2 232314 16.pdf

在B +→K +ℓ+ ℓ−衰变中搜索Lepton-Universality违例

Universality of flux-fluctuation law in complex dynamical systems

Universality in network dynamics Barabási经典文献

基于BP网络的混沌时间序列预测的研究中北大学硕士论文-基于BP网络的混沌时间序列预测的研究.rar

Keeping Found Things Found. The Study and Practice of Personal Information Management.pdf

最新《机器学习最优化》课程笔记

MIT 麻省理工 算法课程-第八节-讲义(经典！)

利用Bc→D(*)衰变测定|Vub|：竞争力提取与Lepton flavor universality测试

最新资源

MIT 麻省理工算法课程-第八节-讲义(经典！)