Published as a conference paper at ICLR 2020
to maximize I(ˆy
∗
; D|z
∗
, θ) to prevent memorization. We can bound this mutual information by
I(ˆy
∗
; D|z
∗
, θ)
≥I(x
∗
; ˆy
∗
|θ, z
∗
) = I(x
∗
; ˆy
∗
|θ) − I(x
∗
; z
∗
|θ) + I(x
∗
; z
∗
|ˆy
∗
, θ)
≥I(x
∗
; ˆy
∗
|θ) − I(x
∗
; z
∗
|θ)
=I(x
∗
; ˆy
∗
|θ) − E
p(x
∗
)q(z
∗
|x
∗
,θ)
log
q(z
∗
|x
∗
, θ)
q(z
∗
|θ)
≥I(x
∗
; ˆy
∗
|θ) − E
log
q(z
∗
|x
∗
, θ)
r(z
∗
)
= I(x
∗
; ˆy
∗
|θ) − E [D
KL
(q(z
∗
|x
∗
, θ)||r(z
∗
))] (2)
where r(z
∗
) is a variational approximation to the marginal, the first inequality follows from the
statistical dependencies in our model (see Figure 4 and Appendix A.2 for the proof). By simultane-
ously minimizing E [D
KL
(q(z
∗
|x
∗
, θ)||r(z
∗
))] and maximizing the mutual information I(x
∗
; ˆy
∗
|θ),
we can implicitly encourage the model to use the task training data D.
For non-mutually-exclusive problems, the true label y
∗
is dependent on x
∗
. If the model has the
memorization problem and I(x
∗
; ˆy
∗
|θ) = 0, then q(ˆy
∗
|x
∗
, θ, D) = q(ˆy
∗
|x
∗
, θ) = q(ˆy
∗
|θ), which
means the model predictions do not depend on x
∗
or D. Hence, in practical problems, the predictions
generated from the model will have low accuracy.
This suggests minimizing the training loss in Eq. (1) can increase I(ˆy
∗
; D|x
∗
, θ) or I(x
∗
; ˆy
∗
|θ).
Replacing the maximization of I(x
∗
; ˆy
∗
|θ) in Eq. (2) with minimizing the training loss results in the
following regularized training objective
1
N
P
i
E
q( θ|M)q(φ|D
i
,θ)
"
−
1
K
P
(x
∗
,y
∗
)∈D
∗
i
log q(ˆy
∗
= y
∗
|x
∗
, φ, θ) + βD
KL
(q(z
∗
|x
∗
, θ)||r(z
∗
))
#
(3)
where log q(ˆy
∗
|x
∗
, φ, θ) is estimated by log q(ˆy
∗
|z
∗
, φ, θ) with z
∗
∼ q(z
∗
|x
∗
, θ), β modulates the
regularizer and r(z
∗
) can be set as N(z
∗
; 0, I). We refer to this regularizer as meta-regularization
(MR) on the activations.
As we demonstrate in Section 6, we find that this regularizer performs well, but in some cases can
fail to prevent the memorization problem. Our hypothesis is that in these cases, the network can
sidestep the information constraint by storing the prediction of y
∗
in a part of z
∗
, which incurs a
small penalty in Eq. (3) and small lower bound in Eq. (2).
4.2 META REGULARIZATION ON WEIGHTS
Alternatively, we can penalize the task information stored in the meta-parameters θ. Here, we pro-
vide an informal argument and provide the complete argument in Appendix A.3. Analogous to
the supervised setting (Achille & Soatto, 2018), given meta-training dataset M, we consider θ
as random variable where the randomness can be introduced by training stochasticity. We model
the stochasticity over θ with a Gaussian distribution N(θ; θ
µ
, θ
σ
) with learned mean and vari-
ance parameters per dimension (Blundell et al., 2015; Achille & Soatto, 2018). By penalizing
I(y
∗
1:N
, D
1:N
; θ|x
∗
1:N
), we can limit the information about the training tasks stored in the meta-
parameters θ and thus require the network to use the task training data to make accurate predictions.
We can tractably upper bound it by
I(y
∗
1:N
, D
1:N
; θ|x
∗
1:N
) = E
h
log
q(θ|M)
q(θ|x
∗
1:N
)
i
≤ E [D
KL
(q(θ|M)kr(θ))] , (4)
where r(θ) is a variational approximation to the marginal, which we set to N(θ; 0, I). In practice,
we apply meta-regularization to the meta-parameters θ that are not used to adapt to the task training
data and denote the other parameters as
˜
θ. In this way, we control the complexity of the network that
can predict the test labels without using task training data, but we do not limit the complexity of the
network that processes the task training data. Our final meta-regularized objective can be written as
1
N
P
i
E
q(θ;θ
µ
,θ
σ
)q(φ|D
i
,
˜
θ)
"
−
1
K
P
(x
∗
,y
∗
)∈D
∗
i
log q(ˆy
∗
= y
∗
|x
∗
, φ, θ,
˜
θ) + βD
KL
(q(θ; θ
µ
, θ
σ
)||r(θ))
#
(5)
5