Meta-learning for Few-shot Natural Language Processing: A Survey
Wenpeng Yin
Salesforce Research
wyin@salesforce.com
Abstract
Few-shot natural language processing (NLP)
refers to NLP tasks that are accompanied with
merely a handful of labeled examples. This is
a real-world challenge that an AI system must
learn to handle. Usually we rely on collect-
ing more auxiliary information or developing
a more efficient learning algorithm. However,
the general gradient-based optimization in
high capacity models, if training from scratch,
requires many parameter-updating steps over a
large number of labeled examples to perform
well (Snell et al., 2017).
If the target task itself cannot provide more
information, how about collecting more tasks
equipped with rich annotations to help the
model learning? The goal of meta-learning is
to train a model on a variety of tasks with rich
annotations, such that it can solve a new task
using only a few labeled samples. The key
idea is to train the model’s initial parameters
such that the model has maximal performance
on a new task after the parameters have been
updated through zero or a couple of gradient
steps.
There are already some surveys for meta-
learning, such as (Vilalta and Drissi, 2002;
Vanschoren, 2018; Hospedales et al., 2020).
Nevertheless, this paper focuses on NLP do-
main, especially few-shot applications. We try
to provide clearer definitions, progress sum-
mary and some common datasets of applying
meta-learning to few-shot NLP.
1 What is meta-learning?
To solve a new task which has only a few exam-
ples, meta-learning aims to build
efficient
algo-
rithms (e.g., need a few or even no task-specific
fine-tuning) that can learn the new task quickly.
Conventionally, we train a task-specific model by
iterating on the task-specific labeled examples. For
example, we treat an input sentence as a training
example in text classification problems. In contrast,
the meta-learning framework
treats tasks as train-
ing examples
—to solve a new task, we first collect
lots of tasks, treating each as a training example
and train a model to adapt to all those training tasks,
finally this model is expected to work well for the
new task.
In the regular text classification tasks, we usu-
ally assume that the training sentences and test
sentences are in the same distribution. Similarly,
meta-learning also assumes that the training tasks
and the new task are from the same distribution of
tasks
p(T )
. During meta-training, a task
T
i
is sam-
pled from
p(T )
, the model is trained with
K
sam-
ples, and then tested on test set from
T
i
. The test
error on the sampled task
T
i
serves as the training
error of the meta-learning process at the current
i
-
th iteration
1
. After the meta-training, the new task,
sampled from
p(T )
as well, measures the models
performance after learning from K samples.
Since the new task only has
K
labeled examples
and a large set of unlabeled test instances, each
training task also keeps merely
K
labeled exam-
ples during the training. This is to make sure that
the training examples (means those training tasks
here) have the same distribution as the test example
(means the new task here). Usually, the
K
labeled
examples are called “support set”.
To describe meta-learning at a higher level: meta-
learning doesn’t learn how to solve a specific task.
It successively learns to solve many tasks. Each
time it learns a new task, it becomes better at learn-
ing new tasks: it learns to learn if “its performance
at each task improves with experience and with the
number of tasks” (Thrun and Pratt, 1998).
Meta-learning vs. Transfer learning.
Conven-
tionally, transfer learning uses past experience of a
1
Here the “test error” is the training loss, because what we
really care is the test performance on the target task.
arXiv:2007.09604v1 [cs.CL] 19 Jul 2020