【免费】One-shotlearningwithMemory-AugmentedNeuralNetworks

meta-learnin

需积分: 0 34 浏览量更新于2023-05-20 评论收藏 2.17MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

One-shot Learning with Memory-Augmented Neural Networks

Adam Santoro ADAMSANTORO@GOOGLE.COM

Google DeepMind

Sergey Bartunov SBOS@SBOS.IN

Google DeepMind, National Research University Higher School of Economics (HSE)

Matthew Botvinick BOTVINICK@GOOGLE.COM

Daan Wierstra WIERSTRA@GOOGLE.COM

Timothy Lillicrap COUNTZERO@GOOGLE.COM

Google DeepMind

Abstract

Despite recent breakthroughs in the applications

of deep neural networks, one setting that presents

a persistent challenge is that of “one-shot learn-

ing.” Traditional gradient-based networks require

a lot of data to learn, often through extensive it-

erative training. When new data is encountered,

the models must inefﬁciently relearn their param-

eters to adequately incorporate the new informa-

tion without catastrophic interference. Architec-

tures with augmented memory capacities, such as

Neural Turing Machines (NTMs), offer the abil-

ity to quickly encode and retrieve new informa-

tion, and hence can potentially obviate the down-

sides of conventional models. Here, we demon-

strate the ability of a memory-augmented neu-

ral network to rapidly assimilate new data, and

leverage this data to make accurate predictions

after only a few samples. We also introduce a

new method for accessing an external memory

that focuses on memory content, unlike previous

methods that additionally use memory location-

based focusing mechanisms.

1. Introduction

The current success of deep learning hinges on the abil-

ity to apply gradient-based optimization to high-capacity

models. This approach has achieved impressive results on

many large-scale supervised tasks with raw sensory input,

such as image classiﬁcation (He et al., 2015), speech recog-

nition (Yu & Deng, 2012), and games (Mnih et al., 2015;

Silver et al., 2016). Notably, performance in such tasks is

typically evaluated after extensive, incremental training on

large data sets. In contrast, many problems of interest re-

quire rapid inference from small quantities of data. In the

limit of “one-shot learning,” single observations should re-

sult in abrupt shifts in behavior.

This kind of ﬂexible adaptation is a celebrated aspect of hu-

man learning (Jankowski et al., 2011), manifesting in set-

tings ranging from motor control (Braun et al., 2009) to the

acquisition of abstract concepts (Lake et al., 2015). Gener-

ating novel behavior based on inference from a few scraps

of information – e.g., inferring the full range of applicabil-

ity for a new word, heard in only one or two contexts – is

something that has remained stubbornly beyond the reach

of contemporary machine intelligence. It appears to present

a particularly daunting challenge for deep learning. In sit-

uations when only a few training examples are presented

one-by-one, a straightforward gradient-based solution is to

completely re-learn the parameters from the data available

at the moment. Such a strategy is prone to poor learning,

and/or catastrophic interference. In view of these hazards,

non-parametric methods are often considered to be better

suited.

However, previous work does suggest one potential strat-

egy for attaining rapid learning from sparse data, and

hinges on the notion of meta-learning (Thrun, 1998; Vi-

lalta & Drissi, 2002). Although the term has been used

in numerous senses (Schmidhuber et al., 1997; Caruana,

1997; Schweighofer & Doya, 2003; Brazdil et al., 2003),

meta-learning generally refers to a scenario in which an

agent learns at two levels, each associated with different

time scales. Rapid learning occurs within a task, for ex-

ample, when learning to accurately classify within a par-

ticular dataset. This learning is guided by knowledge

accrued more gradually across tasks, which captures the

way in which task structure varies across target domains

(Giraud-Carrier et al., 2004; Rendell et al., 1987; Thrun,

1998). Given its two-tiered organization, this form of meta-

arXiv:1605.06065v1 [cs.LG] 19 May 2016

One-shot learning with Memory-Augmented Neural Networks

learning is often described as “learning to learn.”

It has been proposed that neural networks with mem-

ory capacities could prove quite capable of meta-learning

(Hochreiter et al., 2001). These networks shift their bias

through weight updates, but also modulate their output by

learning to rapidly cache representations in memory stores

(Hochreiter & Schmidhuber, 1997). For example, LSTMs

trained to meta-learn can quickly learn never-before-seen

quadratic functions with a low number of data samples

(Hochreiter et al., 2001).

Neural networks with a memory capacity provide a promis-

ing approach to meta-learning in deep networks. However,

the speciﬁc strategy of using the memory inherent in un-

structured recurrent architectures is unlikely to extend to

settings where each new task requires signiﬁcant amounts

of new information to be rapidly encoded. A scalable so-

lution has a few necessary requirements: First, information

must be stored in memory in a representation that is both

stable (so that it can be reliably accessed when needed) and

element-wise addressable (so that relevant pieces of infor-

mation can be accessed selectively). Second, the number

of parameters should not be tied to the size of the mem-

ory. These two characteristics do not arise naturally within

standard memory architectures, such as LSTMs. How-

ever, recent architectures, such as Neural Turing Machines

(NTMs) (Graves et al., 2014) and memory networks (We-

ston et al., 2014), meet the requisite criteria. And so, in this

paper we revisit the meta-learning problem and setup from

the perspective of a highly capable memory-augmented

neural network (MANN) (note: here on, the term MANN

will refer to the class of external-memory equipped net-

works, and not other “internal” memory-based architec-

tures, such as LSTMs).

We demonstrate that MANNs are capable of meta-learning

in tasks that carry signiﬁcant short- and long-term mem-

ory demands. This manifests as successful classiﬁcation

of never-before-seen Omniglot classes at human-like accu-

racy after only a few presentations, and principled function

estimation based on a small number of samples. Addition-

ally, we outline a memory access module that emphasizes

memory access by content, and not additionally on mem-

ory location, as in original implementations of the NTM

(Graves et al., 2014). Our approach combines the best of

two worlds: the ability to slowly learn an abstract method

for obtaining useful representations of raw data, via gra-

dient descent, and the ability to rapidly bind never-before-

seen information after a single presentation, via an external

memory module. The combination supports robust meta-

learning, extending the range of problems to which deep

learning can be effectively applied.

2. Meta-Learning Task Methodology

Usually, we try to choose parameters θ to minimize a learn-

ing cost L across some dataset D. However, for meta-

learning, we choose parameters to reduce the expected

learning cost across a distribution of datasets p(D):

∗

= argmin

D∼p(D)

[L(D; θ)]. (1)

To accomplish this, proper task setup is critical (Hochre-

iter et al., 2001). In our setup, a task, or episode, in-

volves the presentation of some dataset D = {d

}

t=1

{(x

, y

)}

t=1

. For classiﬁcation, y

is the class label for

an image x

, and for regression, y

is the value of a hid-

den function for a vector with real-valued elements x

, or

simply a real-valued number x

(here on, for consistency,

will be used). In this setup, y

is both a target, and

is presented as input along with x

, in a temporally off-

set manner; that is, the network sees the input sequence

, null), (x

, y

), . . . , (x

, y

T −1

). And so, at time t the

correct label for the previous data sample (y

t−1

) is pro-

vided as input along with a new query x

(see Figure 1 (a)).

The network is tasked to output the appropriate label for

(i.e., y

) at the given timestep. Importantly, labels are

shufﬂed from dataset-to-dataset. This prevents the network

from slowly learning sample-class bindings in its weights.

Instead, it must learn to hold data samples in memory un-

til the appropriate labels are presented at the next time-

step, after which sample-class information can be bound

and stored for later use (see Figure 1 (b)). Thus, for a given

episode, ideal performance involves a random guess for the

ﬁrst presentation of a class (since the appropriate label can

not be inferred from previous episodes, due to label shuf-

ﬂing), and the use of memory to achieve perfect accuracy

thereafter. Ultimately, the system aims at modelling the

predictive distribution p(y

, D

1:t−1

; θ), inducing a cor-

responding loss at each time step.

This task structure incorporates exploitable meta-

knowledge: a model that meta-learns would learn to bind

data representations to their appropriate labels regardless

of the actual content of the data representation or label,

and would employ a general scheme to map these bound

representations to appropriate classes or function values

for prediction.

3. Memory-Augmented Model

3.1. Neural Turing Machines

The Neural Turing Machine is a fully differentiable imple-

mentation of a MANN. It consists of a controller, such as

a feed-forward network or LSTM, which interacts with an

external memory module using a number of read and write

heads (Graves et al., 2014). Memory encoding and retrieval

in a NTM external memory module is rapid, with vector

One-shot learning with Memory-Augmented Neural Networks

(a) Task setup (b) Network strategy

Figure 1. Task structure. (a) Omniglot images (or x-values for regression), x

, are presented with time-offset labels (or function values),

t−1

, to prevent the network from simply mapping the class labels to the output. From episode to episode, the classes to be presented

in the episode, their associated labels, and the speciﬁc samples are all shufﬂed. (b) A successful strategy would involve the use of an

external memory to store bound sample representation-class label information, which can then be retrieved at a later point for successful

classiﬁcation when a sample from an already-seen class is presented. Speciﬁcally, sample data x

from a particular time step should be

bound to the appropriate class label y

, which is presented in the subsequent time step. Later, when a sample from this same class is

seen, it should retrieve this bound information from the external memory to make a prediction. Backpropagated error signals from this

prediction step will then shape the weight updates from the earlier steps in order to promote this binding strategy.

representations being placed into or taken out of memory

potentially every time-step. This ability makes the NTM

a perfect candidate for meta-learning and low-shot predic-

tion, as it is capable of both long-term storage via slow up-

dates of its weights, and short-term storage via its exter-

nal memory module. Thus, if a NTM can learn a general

strategy for the types of representations it should place into

memory and how it should later use these representations

for predictions, then it may be able use its speed to make

accurate predictions of data that it has only seen once.

The controllers employed in our model are are either

LSTMs, or feed-forward networks. The controller inter-

acts with an external memory module using read and write

heads, which act to retrieve representations from memory

or place them into memory, respectively. Given some in-

put, x

, the controller produces a key, k

, which is then

either stored in a row of a memory matrix M

, or used to

retrieve a particular memory, i, from a row; i.e., M

(i).

When retrieving a memory, M

is addressed using the co-

sine similarity measure,



, M

(i)



· M

(i)

k k

kk M

(i) k

, (2)

which is used to produce a read-weight vector, w

, with

elements computed according to a softmax:

(i) ←

exp



, M

(i)



exp



, M

(j)



. (3)

A memory, r

, is retrieved using this weight vector:

←

(i)M

(i). (4)

This memory is used by the controller as the input to a clas-

siﬁer, such as a softmax output layer, and as an additional

input for the next controller state.

3.2. Least Recently Used Access

In previous instantiations of the NTM (Graves et al., 2014),

memories were addressed by both content and location.

Location-based addressing was used to promote iterative

steps, akin to running along a tape, as well as long-distance

jumps across memory. This method was advantageous for

sequence-based prediction tasks. However, this type of ac-

cess is not optimal for tasks that emphasize a conjunctive

coding of information independent of sequence. As such,

writing to memory in our model involves the use of a newly

designed access module called the Least Recently Used

Access (LRUA) module.

The LRUA module is a pure content-based memory writer

that writes memories to either the least used memory lo-

cation or the most recently used memory location. This

module emphasizes accurate encoding of relevant (i.e., re-

cent) information, and pure content-based retrieval. New

information is written into rarely-used locations, preserv-

ing recently encoded information, or it is written to the

last used location, which can function as an update of the

memory with newer, possibly more relevant information.

The distinction between these two options is accomplished

with an interpolation between the previous read weights

and weights scaled according to usage weights w

. These

usage weights are updated at each time-step by decaying

the previous usage weights and adding the current read and

剩余12页未读，继续阅读

傅里叶、

粉丝: 134
资源: 51

会员权益专享

One-shot learning with Memory-Augmented Neural Networks

评论0

会员权益专享

最新资源

One-shot learning with Memory-Augmented Neural Networks

评论0

One-Shot Learning of Object Categories (ppt)

Matching Networks for One Shot Learning.pdf

1974-TAC-Model reference adaptive control with an augmented error signal.pdf

Memory-Augmented Neural Networks和Prototypical Networks哪个更常用

关于meta- learning的图片和视频分类的模型有哪些

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data的主要方法

对于元学习模型如何学习有用的元知识

stable-diffusion中autoencoder，latent-diffusion，retrieval-augmented-diffusion的作用及关联

Deep Reinforcement Learning Approach for UAV-Assisted Mobile Edge Computing Networks

attention augmented convolutional networks

LSTM-Attention

请你写一段关于元学习的文字，要求引用四篇权威文献并标明出处

stable-diffusion中retrieval-augmented-diffusion是什么，有什么作用，举例

augmented image

PoseNet TensorFlow

retrieval-augmented generation

Text-Augmented Open Knowledge Graph Completion via Pre-Trained Language Models

lstm attention

会员权益专享

最新资源