queried each of the shadow models with its training data
(members), as well as unseen data (non-members) to re-
trieve the prediction scores of the shadow models. Multiple
binary classifiers were then trained for each class label to
predict the membership status.
Salem et al. [25] also exploited prediction scores and
trained a single class-agnostic neural network to distinguish
between members and non-members. In contrast to Shokri
et al. [26], their approach relies on a single shadow model.
The input of h consists of the k highest prediction scores in
descending order.
Instead of focusing solely on the scores, Yeom et al. [33]
took advantage of the fact that the loss of a model is lower
on members than on non-members and fit a threshold to
the loss values. More recent approaches [3, 16] focused
on label-only attacks where only the predicted label for a
known input is observed.
Most defense strategies either try to decrease the in-
formative value of prediction scores or reduce overfitting.
The informative value can be decreased by adding a large
temperature to the softmax function to increase its entropy
[26], adding carefully crafted noise to the predictions [9] or
outputting only the predicted label without any score [26].
Various regularization techniques were proposed to reduce
overfitting and thus the accuracy gap, e.g., L2 regularization
[26] and dropout [26, 25].
3 Overconfidence of Neural Networks
Neural networks usually output prediction scores, e.g.,
by applying a softmax function. To take model uncer-
tainty into account, it is generally desired that the predic-
tion scores represent the probability of a correct prediction,
which is usually not the case. This problem is generally re-
ferred to as model calibration. Guo et al. [5] demonstrated
that modern networks tend to be overconfident in their pre-
dictions.
Generally, as Hein et al. [7] noted, there have been many
cases reported where high prediction scores are made far
away from the training data by neural networks, e.g., on
fooling images, for out-of-distribution (OOD) images, in a
medical diagnosis task, but also on the original task. Hein et
al. then proved that ReLU networks are overconfident even
on samples far away from the training data.
Scaling the inputs to ReLU network actually allows one
to produce arbitrarily high prediction scores. Existing ap-
proaches to mitigate overconfidence can be grouped into
two categories: post-processing methods applied on top of
trained models and regularization methods modifying the
training process.
As a post-processing method, Guo et al. [5] proposed
temperature scaling using a single temperature parameter T
for scaling down the pre-softmax logits for all classes. The
larger T is, the more the resulting scores approach a uniform
distribution while its entropy increases.
Kristiadi et al. [13] proposed a Bayesian approach. They
fixed the weights for all layers of a trained network except
the last one and used a Kronecker-factored Laplace approx-
imation (LA) on the weights of the final layer. M
¨
uller et al.
[17] demonstrated that label smoothing regularization [28]
not only improves the generalization of a model but also
implicitly leads to better model calibration. It reduces the
difference between the highest and the other logit values,
thus reducing overconfident predictions. The calibration of
a model can be measured by the expected calibration error
(ECE) [18]. It computes a weighted average over the abso-
lute difference between test accuracy and prediction scores.
4 Do Not Trust Prediction Scores for MIAs
In this section, we will show that predictions scores for
MIAs cannot be trusted because score-based MIAs make
membership decisions based mainly on the maximum pre-
diction score. As a first step, we introduce our proposition
and then verify our claims empirically.
Formally, a neural network f (x) using ReLU activations
decomposes the unrestricted input space R
m
into a finite set
of polytopes (linear regions). We can then interpret f(x) as
a piecewise affine function that is affine in any polytope.
Due to the limited number of polytopes, the outer polytopes
extend to infinity which allows to arbitrarily increase the
prediction scores through scaling inputs by a large constant
δ [7]. Applying these findings to MIAs results in the fol-
lowing proposition:
Proposition 1. Given a ReLU-classifier, we can force al-
most any non-member input to be classified as a member
by score-based MIAs simply through scaling it by a large
constant.
Proof. Let f : R
m
→ R
d
be a piecewise affine ReLU-
classifier. We define a score-based MIA inference model
h : R
d
→ {0, 1} with 1 indicating a classification as mem-
ber. For almost any input x ∈ R
m
and a sufficiently
small > 0 if max
i=1,...,d
f(x)
i
≥ 1 − it follows that
h(f(x)) = 1. Since lim
δ→∞
max
i=1,...,d
f(δx)
i
= 1, then
lim
δ→∞
h(f(δx)) = 1 already holds.
By scaling the whole non-member dataset, one can force
the FPR to be close to 100%. Indeed, the proposition holds
only for ReLU-networks and unbounded inputs, which are
not restricted to the range of [0, 1]
m
. Next, we empirically
show that one cannot trust predictions scores for MIAs in
more general settings without input scaling required and us-
ing other activation functions.
3