知识图嵌入：Softmax交叉熵与负采样的统一理解

版权申诉

158 浏览量更新于2024-07-06 收藏 306KB PDF 举报

"这篇论文《Softmax交叉熵和负抽样的统一解释及知识图嵌入的实例研究》探讨了在知识图谱嵌入领域中，Softmax交叉熵和负抽样两种损失函数之间的理论关系。作者通过使用Bregman散度提供了一种统一的解释，使得这两种损失函数能够公平地进行比较。实验结果在FB15k-237和WN18RR数据集上验证了理论发现的实用性。" 在机器学习和深度学习中，Softmax交叉熵和负抽样是两种广泛使用的损失函数，尤其是在知识图谱嵌入和自然语言处理任务中。Softmax交叉熵是多分类问题的标准损失函数，它能够衡量模型预测的概率分布与实际类别分布之间的差异。而负抽样是一种更高效的近似方法，常用于词向量学习（如Word2Vec）中，以减少计算成本。 Softmax交叉熵损失函数计算每个样本属于正确类别的对数概率，然后取负值作为损失。其优点是能够直接反映出模型对所有类别的预测能力，但缺点是在大规模数据集上计算量较大，因为它需要考虑所有负样本。负抽样则通过随机选取一部分负样本来近似整个负样本空间，减少了计算复杂性。它的核心思想是选取一小部分有代表性的负样本进行比较，这样可以加速训练过程，同时保持较好的性能。论文中，作者通过Bregman散度这一数学工具，将Softmax交叉熵和负抽样联系起来。Bregman散度是一种度量两个分布差异的方法，它能反映一个分布如何偏离另一个分布。在该研究中，Bregman散度被用来提供一种理论框架，使得两种损失函数能够在相同的度量标准下进行比较，从而揭示它们的内在相似性和不同之处。实验部分，作者在FB15k-237和WN18RR这两个知识图谱数据集上应用了统一解释后的损失函数，结果证明了理论分析的有效性。这表明，对于知识图谱嵌入的任务，理解两种损失函数的统一解释有助于优化模型设计和训练策略，提高模型的性能。这篇论文为理解和比较Softmax交叉熵和负抽样提供了一个新的视角，对于进一步优化知识图谱嵌入和其他相关领域的模型具有重要的理论和实践意义。

Proposition 5.

When

(y|x)

is a uniform distri-

bution, Eq. (9) equals p

(y|x).

Proof.

This is described in Appendix C of the sup-

plemental material.

Dyer (2014) indicated that NS is equal to NCE

when

ν = |Y |

and

(y|x)

is uniform. However,

as we showed, in terms of the objective distribu-

tion, the value of

is not related to the objective

distribution because Eq. (9) is independent of ν .

3.1.2 NS with Frequency-based Noise

In the original setting of NS (Mikolov et al., 2013),

the authors chose as

(y|x)

a unigram distribution

, which is independent of

. Such a frequency-

based distribution is calculated in terms of frequen-

cies on a corpus and independent of the model

parameter

. Since in this case, different from

the case of a uniform distribution,

(y|x)

remains

on the right side of Eq. (9),

(y|x)

decreases

when

(y|x)

increases. Thus, we can interpret

frequency-based noise as a type of smoothing for

(y|x)

. The smoothing of NS w/ Freq decreases

the importance of high-frequency labels in the train-

ing data for learning more general vector represen-

tations, which can be used for various tasks as pre-

trained vectors. Since we can expect pre-trained

vectors to work as a prior (Erhan et al., 2010) that

prevents models from overﬁtting, we tried to use

NS w/ Freq for pre-training KGE models in our

experiments.

3.1.3 Self-Adversarial NS

Sun et al. (2019) recently proposed SANS, which

uses

(y|x)

for generating negative samples. By

replacing

(y|x)

with

(y|x)

, the objective distri-

bution when using SANS is as follows:

(y|x) =

(y|x)

∑

∈Y

|x)

, (10)

where

is a parameter set updated in the previ-

ous iteration. Because both the left and right sides

of Eq. (10) include

(y|x)

, we cannot obtain an

analytical solution of

(y|x)

from this equation.

However, we can consider special cases of

(y|x)

to gain an understanding of Eq. (10). At the begin-

ning of the training,

(y|x)

follows a discrete uni-

form distribution

u{1,|Y |}

because

is randomly

initialized. In this situation, when we set

(y|x)

Eq. (10) to a discrete uniform distribution

u{1,|Y |}

(y|x) becomes

(y|x) = p

(y|x). (11)

Next, when we set

(y|x)

in Eq. (10) as

(y|x)

(y|x) becomes

(y|x) = u{1, |Y |}. (12)

In actual mini-batch training,

is iteratively up-

dated for every batch of data. Because

(y|x)

con-

verges to

u{1,|Y |}

when

(y|x)

is close to

(y|x)

and

(y|x)

converges to

(y|x)

when

(y|x)

close to

u{1,|Y |}

, we can approximately regard the

objective distribution of SANS as a mixture of

and

u{1,|Y |}

. Thus, we can represent the objective

distribution of p

(y|x) as

(y|x) ≈ (1 − λ )p

(y|x) + λ u{1, |Y |} (13)

where

is a hyper-parameter to determine whether

(y|x)

is close to

(y|x)

u{1,|Y |}

. Assum-

ing that

(y|x)

starts from

u{1,|Y |}

should

start from

and gradually increase through train-

ing. Note that

corresponds to a temperature

for p

(y|x) in SANS, deﬁned as

(y|x) =

exp(α f

(x, y))

∑

∈Y

exp(α f

(x, y

))

, (14)

where

also adjusts

(y|x)

to be close to

(y|x)

or u{1,|Y |}.

4 Theoretical Relationships among Loss

Functions

4.1 Corresponding SCE form to NS with

Frequency-based Noise

We induce a corresponding cross entropy loss from

the objective distribution for NS with frequency-

based noise. We set

x,y

= p

(y|x)

∑

∈Y

|x)

q(y|x) = T

−1

x,y

(y|x)

, and

Ψ(z) =

∑

len(z)

i=1

logz

Under these conditions, following induction from

Eq. (4) to Eq. (5), we can reformulate

Ψ(z)

(q(y|x), p(y|x)) as follows:

Ψ(z)

(q(y|x), p

(y|x))

= −

∑

x,y

|Y |

∑

i=1

−1

x,y

|x)log p

|x)

(x, y)

= −

|D|

∑

(x,y)∈D

−1

x,y

log p

(y|x). (15)

剩余15页未读，继续阅读

易小侠

粉丝: 6625
资源: 9万+

知识图嵌入：Softmax交叉熵与负采样的统一理解

PyTorch的SoftMax交叉熵损失和梯度用法

7_交叉熵_softmax_损失函数loss_分析

softmax交叉熵损失

python编写softmax函数、交叉熵函数实例

softmax的交叉熵导数推导.docx

softmax与交叉熵损失函数的理解

解释并举例softmax和交叉熵损失函数

softmax和交叉熵损失函数

softmax loss和交叉熵损失

softmax层和交叉熵损失函数

最新资源