4
a source model to be robust to such domain-shift without
further adaptation. Many knowledge transfer methods have
been studied [34], [58] to boost target domain performance.
However, as for TL, vanilla DA and DG don’t use a meta-
objective to optimize ‘how to learn’ across domains. Mean-
while, meta-learning methods can be used to perform both
DA [59] and DG [42] (see Sec. 5.8).
Continual learning (CL) Continual or lifelong learning
[60]–[62] refers to the ability to learn on a sequence of tasks
drawn from a potentially non-stationary distribution, and
in particular seek to do so while accelerating learning new
tasks and without forgetting old tasks. Similarly to meta-
learning, a task distribution is considered, and the goal is
partly to accelerate learning of a target task. However most
continual learning methodologies are not meta-learning
methodologies since this meta objective is not solved for
explicitly. Nevertheless, meta-learning provides a potential
framework to advance continual learning, and a few recent
studies have begun to do so by developing meta-objectives
that encode continual learning performance [63]–[65].
Multi-Task Learning (MTL) aims to jointly learn sev-
eral related tasks, to benefit from regularization due to
parameter sharing and the diversity of the resulting shared
representation [66]–[68], as well as compute/memory sav-
ings. Like TL, DA, and CL, conventional MTL is a single-
level optimization without a meta-objective. Furthermore,
the goal of MTL is to solve a fixed number of known tasks,
whereas the point of meta-learning is often to solve unseen
future tasks. Nonetheless, meta-learning can be brought in
to benefit MTL, e.g. by learning the relatedness between
tasks [69], or how to prioritise among multiple tasks [70].
Hyperparameter Optimization (HO) is within the remit
of meta-learning, in that hyperparameters like learning rate
or regularization strength describe ‘how to learn’. Here we
include HO tasks that define a meta objective that is trained
end-to-end with neural networks, such as gradient-based
hyperparameter learning [69], [71] and neural architecture
search [18]. But we exclude other approaches like random
search [72] and Bayesian Hyperparameter Optimization
[73], which are rarely considered to be meta-learning.
Hierarchical Bayesian Models (HBM) involve Bayesian
learning of parameters θ under a prior p(θ|ω). The prior
is written as a conditional density on some other variable
ω which has its own prior p(ω). Hierarchical Bayesian
models feature strongly as models for grouped data D =
{D
i
|i = 1, 2, . . . , M}, where each group i has its own
θ
i
. The full model is
h
Q
M
i=1
p(D
i
|θ
i
)p(θ
i
|ω)
i
p(ω). The lev-
els of hierarchy can be increased further; in particular ω
can itself be parameterized, and hence p(ω) can be learnt.
Learning is usually full-pipeline, but using some form of
Bayesian marginalisation to compute the posterior over
ω: P (ω|D) ∼ p(ω)
Q
M
i=1
R
dθ
i
p(D
i
|θ
i
)p(θ
i
|ω). The ease of
doing the marginalisation depends on the model: in some
(e.g. Latent Dirichlet Allocation [74]) the marginalisation is
exact due to the choice of conjugate exponential models,
in others (see e.g. [75]), a stochastic variational approach is
used to calculate an approximate posterior, from which a
lower bound to the marginal likelihood is computed.
Bayesian hierarchical models provide a valuable view-
point for meta-learning, by providing a modeling rather
than an algorithmic framework for understanding the meta-
learning process. In practice, prior work in HBMs has typi-
cally focused on learning simple tractable models θ while
most meta-learning work considers complex inner-loop
learning processes, involving many iterations. Nonetheless,
some meta-learning methods like MAML [16] can be under-
stood through the lens of HBMs [76].
AutoML: AutoML [31]–[33] is a rather broad umbrella
for approaches aiming to automate parts of the machine
learning process that are typically manual, such as data
preparation, algorithm selection, hyper-parameter tuning,
and architecture search. AutoML often makes use of numer-
ous heuristics outside the scope of meta-learning as defined
here, and focuses on tasks such as data cleaning that are
less central to meta-learning. However, AutoML sometimes
makes use of end-to-end optimization of a meta-objective,
so meta-learning can be seen as a specialization of AutoML.
3 TAXONOMY
3.1 Previous Taxonomies
Previous [77], [78] categorizations of meta-learning meth-
ods have tended to produce a three-way taxonomy across
optimization-based methods, model-based (or black box)
methods, and metric-based (or non-parametric) methods.
Optimization Optimization-based methods include those
where the inner-level task (Eq. 6) is literally solved as
an optimization problem, and focuses on extracting meta-
knowledge ω required to improve optimization perfor-
mance. A famous example is MAML [16], which aims to
learn the initialization ω = θ
0
, such that a small number
of inner steps produces a classifier that performs well on
validation data. This is also performed by gradient descent,
differentiating through the updates of the base model. More
elaborate alternatives also learn step sizes [79], [80] or
train recurrent networks to predict steps from gradients
[19], [39], [81]. Meta-optimization by gradient over long
inner optimizations leads to several compute and memory
challenges which are discussed in Section 6. A unified view
of gradient-based meta learning expressing many existing
methods as special cases of a generalized inner loop meta-
learning framework has been proposed [82].
Black Box / Model-based In model-based (or black-box)
methods the inner learning step (Eq. 6, Eq. 4) is wrapped up
in the feed-forward pass of a single model, as illustrated
in Eq. 7. The model embeds the current dataset D into
activation state, with predictions for test data being made
based on this state. Typical architectures include recurrent
networks [39], [51], convolutional networks [38] or hyper-
networks [83], [84] that embed training instances and labels
of a given task to define a predictor for test samples. In this
case all the inner-level learning is contained in the activation
states of the model and is entirely feed-forward. Outer-
level learning is performed with ω containing the CNN,
RNN or hypernetwork parameters. The outer and inner-
level optimizations are tightly coupled as ω and D directly
specify θ. Memory-augmented neural networks [85] use an
explicit storage buffer and can be seen as a model-based