Published as a conference paper at ICLR 2018
Figure 1: A deep fully-connected neural network with N+2 layers and ReLU nonlinearities. With this generic
fully connected network, we prove that, with a single step of gradient descent, the model can approximate any
function of the dataset and test input.
4 UNIVERSALITY OF THE ONE-SHOT GRADIENT-BASED LEARNER
We first introduce a proof of the universality of gradient-based meta-learning for the special case
with only one training point, corresponding to one-shot learning. We denote the training datapoint
as (x, y), and the test input as x
?
. A universal learning algorithm approximator corresponds to the
ability of a meta-learner to represent any function f
target
(x, y, x
?
) up to arbitrary precision.
We will proceed by construction, showing that there exists a neural network function
ˆ
f(·; θ) such that
ˆ
f(x
?
; θ
0
) approximates f
target
(x, y, x
?
) up to arbitrary precision, where θ
0
= θ − α∇
θ
`(y, f(x))
and α is the non-zero learning rate. The proof holds for a standard multi-layer ReLU network,
provided that it has sufficient depth. As we discuss in Section 6, the loss function ` cannot be any
loss function, but the standard cross-entropy and mean-squared error objectives are both suitable.
In this proof, we will start by presenting the form of
ˆ
f and deriving its value after one gradient
step. Then, to show universality, we will construct a setting of the weight matrices that enables
independent control of the information flow coming forward from x and x
?
, and backward from y.
We will start by constructing
ˆ
f, which, as shown in Figure 1 is a generic deep network with N + 2
layers and ReLU nonlinearities. Note that, for a particular weight matrix W
i
at layer i, a single
gradient step W
i
− α∇
W
i
` can only represent a rank-1 update to the matrix W
i
. That is because
the gradient of W
i
is the outer product of two vectors, ∇
W
i
` = a
i
b
T
i−1
, where a
i
is the error
gradient with respect to the pre-synaptic activations at layer i, and b
i−1
is the forward post-synaptic
activations at layer i − 1. The expressive power of a single gradient update to a single weight matrix
is therefore quite limited. However, if we sequence N weight matrices as
Q
N
i=1
W
i
, corresponding
to multiple linear layers, it is possible to acquire a rank-N update to the linear function represented
by W =
Q
N
i=1
W
i
. Note that deep ReLU networks act like deep linear networks when the input and
pre-synaptic activations are non-negative. Motivated by this reasoning, we will construct
ˆ
f(·; θ) as a
deep ReLU network where a number of the intermediate layers act as linear layers, which we ensure
by showing that the input and pre-synaptic activations of these layers are non-negative. This allows
us to simplify the analysis. The simplified form of the model is as follows:
ˆ
f(·; θ) = f
out
N
Y
i=1
W
i
!
φ(·; θ
ft
, θ
b
); θ
out
!
,
where φ(·; θ
ft
, θ
b
) represents an input feature extractor with parameters θ
ft
and a scalar bias transfor-
mation variable θ
b
,
Q
N
i=1
W
i
is a product of square linear weight matrices, f
out
(·, θ
out
) is a function
at the output, and the learned parameters are θ := {θ
ft
, θ
b
, {W
i
}, θ
out
}. The input feature extractor
and output function can be represented with fully connected neural networks with one or more hid-
den layers, which we know are universal function approximators, while
Q
N
i=1
W
i
corresponds to a
set of linear layers with non-negative input and activations.
Next, we derive the form of the post-update prediction
ˆ
f(x
?
; θ
0
). Let z =
Q
N
i=1
W
i
φ(x; θ
ft
, θ
b
),
and the error gradient ∇
z
` = e(x, y). Then, the gradient with respect to each weight matrix W
i
is:
∇
W
i
`(y,
ˆ
f(x, θ)) =
i−1
Y
j=1
W
j
T
e(x, y)φ(x; θ
ft
, θ
b
)
T
N
Y
j=i+1
W
j
T
.
Therefore, the post-update value of
Q
N
i=1
W
0
i
=
Q
N
i=1
(W
i
− α∇
W
i
`) is given by
N
Y
i=1
W
i
−α
N
X
i=1
i−1
Y
j=1
W
j
i−1
Y
j=1
W
j
T
e(x, y)φ(x; θ
ft
, θ
b
)
T
N
Y
j=i+1
W
j
T
N
Y
j=i+1
W
j
−O(α
2
),
4