利用子词信息提升词向量表示

需积分: 0 9 浏览量更新于2024-08-05 收藏 2.26MB PDF 举报

"Enriching Word Vectors with Subword Information(与fastText相关)1" 文章“Enriching Word Vectors with Subword Information”由Piotr Bojanowski、Edouard Grave、Armand Joulin和Tomas Mikolov四位研究人员撰写，他们均来自Facebook AI Research。这篇论文探讨了如何通过子词信息来丰富词向量，从而提高自然语言处理任务的性能。词嵌入（Continuous word representations）是训练于大量未标注语料库的向量表示，对于许多NLP任务非常有用。传统的词嵌入模型，如Word2Vec的CBOW和Skip-Gram模型，通常忽略词的形态学特征，即每个词都有一个独立的向量表示。然而，这种方法在处理词汇量大且存在大量罕见词的语言时具有局限性。作者提出了一个新的方法，该方法基于Skip-Gram模型，并将每个词表示为字符n-gram的集合。每个字符n-gram都有自己的向量表示，而词的向量则由其组成n-gram的向量求和得到。这种“bag-of-character-n-grams”的方法允许快速训练模型，即使在大型语料库上也能高效运行，并且能够为训练数据中未出现的词计算向量表示。在评估中，他们使用九种不同的语言，包括相似性和类比任务，展示了这种新的词向量表示的优越性。与最近提出的考虑形态学信息的词表示方法相比，他们的方法在各种任务上表现出色，特别是在处理罕见词和未登录词（out-of-vocabulary words）时。 FastText是这个方法的一个实际应用，它扩展了Word2Vec，利用字符级别的信息来增强词向量学习。FastText不仅能够捕捉到词内的结构信息，还能够生成那些在训练集中未出现的新词的向量。这种技术使得在处理语言多样性时更为灵活，特别是在处理诸如多音字、拼写错误或罕见形态的场景下。这篇论文和相关的FastText技术对自然语言处理领域产生了深远的影响，它们提供了一种有效处理词汇形态学并提高词向量质量的方法，这对于提升语言模型的性能，尤其是在处理低频词汇和多种语言环境时，有着重要的意义。

One possible choice to deﬁne the probability of a

context word is the softmax:

p(w

| w

) =

s(w

, w

)

j=1

s(w

, j)

However, such a model is not adapted to our case as

it implies that, given a word w

, we only predict one

context word w

The problem of predicting context words can in-

stead be framed as a set of independent binary clas-

siﬁcation tasks. Then the goal is to independently

predict the presence (or absence) of context words.

For the word at position t we consider all context

words as positive examples and sample negatives at

random from the dictionary. For a chosen context

position c, using the binary logistic loss, we obtain

the following negative log-likelihood:

log



1 + e

−s(w

, w

)



n∈N

t,c

log



1 + e

s(w

, n)



where N

t,c

is a set of negative examples sampled

from the vocabulary. By denoting the logistic loss

function ` : x 7→ log(1 + e

−x

), we can re-write the

objective as:

t=1





c∈C

`(s(w

, w

)) +

n∈N

t,c

`(−s(w

, n))





A natural parameterization for the scoring function

s between a word w

and a context word w

is to use

word vectors. Let us deﬁne for each word w in the

vocabulary two vectors u

and v

in R

. These two

vectors are sometimes referred to as input and out-

put vectors in the literature. In particular, we have

vectors u

and v

, corresponding, respectively, to

words w

and w

. Then the score can be computed

as the scalar product between word and context vec-

tors as s(w

, w

) = u

. The model described

in this section is the skipgram model with negative

sampling, introduced by Mikolov et al. (2013b).

3.2 Subword model

By using a distinct vector representation for each

word, the skipgram model ignores the internal struc-

ture of words. In this section, we propose a different

scoring function s, in order to take into account this

information.

Each word w is represented as a bag of character

n-gram. We add special boundary symbols < and >

at the beginning and end of words, allowing to dis-

tinguish preﬁxes and sufﬁxes from other character

sequences. We also include the word w itself in the

set of its n-grams, to learn a representation for each

word (in addition to character n-grams). Taking the

word where and n = 3 as an example, it will be

represented by the character n-grams:

<wh, whe, her, ere, re>

and the special sequence

<where>.

Note that the sequence <her>, corresponding to the

word her is different from the tri-gram her from the

word where. In practice, we extract all the n-grams

for n greater or equal to 3 and smaller or equal to 6.

This is a very simple approach, and different sets of

n-grams could be considered, for example taking all

preﬁxes and sufﬁxes.

Suppose that you are given a dictionary of n-

grams of size G. Given a word w, let us denote by

⊂ {1, . . . , G} the set of n-grams appearing in

w. We associate a vector representation z

to each

n-gram g. We represent a word by the sum of the

vector representations of its n-grams. We thus ob-

tain the scoring function:

s(w, c) =

g∈G

This simple model allows sharing the representa-

tions across words, thus allowing to learn reliable

representation for rare words.

In order to bound the memory requirements of our

model, we use a hashing function that maps n-grams

to integers in 1 to K. We hash character sequences

using the Fowler-Noll-Vo hashing function (speciﬁ-

cally the FNV-1a variant).

We set K = 2.10

be-

low. Ultimately, a word is represented by its index

in the word dictionary and the set of hashed n-grams

it contains.

4 Experimental setup

4.1 Baseline

In most experiments (except in Sec. 5.3), we

compare our model to the C implementation

http://www.isthe.com/chongo/tech/comp/fnv

剩余11页未读，继续阅读

朱王勇

粉丝: 30
资源: 305

利用子词信息提升词向量表示

自然语言处理之动手学词向量（word embedding） 动手学词向量知识讲解 共101页.pdf

Enriching ebXML Registries with OWL Ontologies for Efficient Service Discovery.pdf

Splunk 7 Essentials, 3rd Edition-Packt Publishing(2018)

用英语写一篇预期之外的一段旅行100词

用英语写一篇你应该设计一个海外旅行计划。（城市A到目的地，行程） 1.找出你想讨论的目的地 步骤2收集有关旅行计划的基本信息 步骤3对计划进行排序、分析

错误使用 imread (第 440 行) 内存不足。

Web Animation Using JavaScript - js网页动画-英文原版

用matlab求mcmc代码-enriching_object_detection:丰富对象检测

使用实体信息丰富用于关系分类的预训练语言模型.zip

最新资源

自然语言处理之动手学词向量（word embedding）动手学词向量知识讲解共101页.pdf

用英语写一篇你应该设计一个海外旅行计划。（城市A到目的地，行程） 1.找出你想讨论的目的地步骤2收集有关旅行计划的基本信息步骤3对计划进行排序、分析