Proposition 5.
When
p
n
(y|x)
is a uniform distri-
bution, Eq. (9) equals p
d
(y|x).
Proof.
This is described in Appendix C of the sup-
plemental material.
Dyer (2014) indicated that NS is equal to NCE
when
ν = |Y |
and
P
n
(y|x)
is uniform. However,
as we showed, in terms of the objective distribu-
tion, the value of
ν
is not related to the objective
distribution because Eq. (9) is independent of ν .
3.1.2 NS with Frequency-based Noise
In the original setting of NS (Mikolov et al., 2013),
the authors chose as
p
n
(y|x)
a unigram distribution
of
y
, which is independent of
x
. Such a frequency-
based distribution is calculated in terms of frequen-
cies on a corpus and independent of the model
parameter
θ
. Since in this case, different from
the case of a uniform distribution,
p
n
(y|x)
remains
on the right side of Eq. (9),
p
θ
(y|x)
decreases
when
p
n
(y|x)
increases. Thus, we can interpret
frequency-based noise as a type of smoothing for
p
d
(y|x)
. The smoothing of NS w/ Freq decreases
the importance of high-frequency labels in the train-
ing data for learning more general vector represen-
tations, which can be used for various tasks as pre-
trained vectors. Since we can expect pre-trained
vectors to work as a prior (Erhan et al., 2010) that
prevents models from overfitting, we tried to use
NS w/ Freq for pre-training KGE models in our
experiments.
3.1.3 Self-Adversarial NS
Sun et al. (2019) recently proposed SANS, which
uses
p
θ
(y|x)
for generating negative samples. By
replacing
p
n
(y|x)
with
p
θ
(y|x)
, the objective distri-
bution when using SANS is as follows:
p
θ
(y|x) =
p
d
(y|x)
p
ˆ
θ
(y|x)
∑
y
i
∈Y
p
d
(y
i
|x)
p
ˆ
θ
(y
i
|x)
, (10)
where
ˆ
θ
is a parameter set updated in the previ-
ous iteration. Because both the left and right sides
of Eq. (10) include
p
θ
(y|x)
, we cannot obtain an
analytical solution of
p
θ
(y|x)
from this equation.
However, we can consider special cases of
p
θ
(y|x)
to gain an understanding of Eq. (10). At the begin-
ning of the training,
p
θ
(y|x)
follows a discrete uni-
form distribution
u{1,|Y |}
because
θ
is randomly
initialized. In this situation, when we set
p
ˆ
θ
(y|x)
in
Eq. (10) to a discrete uniform distribution
u{1,|Y |}
,
p
θ
(y|x) becomes
p
θ
(y|x) = p
d
(y|x). (11)
Next, when we set
p
ˆ
θ
(y|x)
in Eq. (10) as
p
d
(y|x)
,
p
θ
(y|x) becomes
p
θ
(y|x) = u{1, |Y |}. (12)
In actual mini-batch training,
θ
is iteratively up-
dated for every batch of data. Because
p
θ
(y|x)
con-
verges to
u{1,|Y |}
when
p
ˆ
θ
(y|x)
is close to
p
d
(y|x)
and
p
θ
(y|x)
converges to
p
d
(y|x)
when
p
ˆ
θ
(y|x)
is
close to
u{1,|Y |}
, we can approximately regard the
objective distribution of SANS as a mixture of
p
d
and
u{1,|Y |}
. Thus, we can represent the objective
distribution of p
θ
(y|x) as
p
θ
(y|x) ≈ (1 − λ )p
d
(y|x) + λ u{1, |Y |} (13)
where
λ
is a hyper-parameter to determine whether
p
θ
(y|x)
is close to
p
d
(y|x)
or
u{1,|Y |}
. Assum-
ing that
p
θ
(y|x)
starts from
u{1,|Y |}
,
λ
should
start from
0
and gradually increase through train-
ing. Note that
λ
corresponds to a temperature
α
for p
ˆ
θ
(y|x) in SANS, defined as
p
ˆ
θ
(y|x) =
exp(α f
θ
(x, y))
∑
y
0
∈Y
exp(α f
θ
(x, y
0
))
, (14)
where
α
also adjusts
p
ˆ
θ
(y|x)
to be close to
p
d
(y|x)
or u{1,|Y |}.
4 Theoretical Relationships among Loss
Functions
4.1 Corresponding SCE form to NS with
Frequency-based Noise
We induce a corresponding cross entropy loss from
the objective distribution for NS with frequency-
based noise. We set
T
x,y
= p
n
(y|x)
∑
y
i
∈Y
p
d
(y
i
|x)
p
n
(y
i
|x)
,
q(y|x) = T
−1
x,y
p
d
(y|x)
, and
Ψ(z) =
∑
len(z)
i=1
z
i
logz
i
.
Under these conditions, following induction from
Eq. (4) to Eq. (5), we can reformulate
B
Ψ(z)
(q(y|x), p(y|x)) as follows:
B
Ψ(z)
(q(y|x), p
θ
(y|x))
= −
∑
x,y
"
|Y |
∑
i=1
T
−1
x,y
p
d
(y
i
|x)log p
θ
(y
i
|x)
#
p
d
(x, y)
= −
1
|D|
∑
(x,y)∈D
T
−1
x,y
log p
θ
(y|x). (15)