FastText: 基于字符n-gram的快速词向量提升

需积分: 0 107 浏览量更新于2024-08-05 收藏 2.24MB PDF 举报

"FastText是一种在2016年发布的先进自然语言处理技术，由Facebook AI Research团队的Piotr Bojanowski、Edouard Grave、Armand Joulin和Tomas Mikolov共同提出。该方法旨在解决传统词向量模型忽视词形信息的问题，特别是在词汇庞大且包含许多罕见词汇的语言中。FastText基于Skip-gram模型，但有所创新。在FastText中，每个单词不再被单独表示为一个向量，而是作为一个字符n-gram（如连续的字符片段）的集合来处理。例如，单词"cat"可能被分解为'n-gram'序列'c', 'ca', 'cat',等。每个字符n-gram都对应一个向量，而单词的向量则由这些n-gram向量的加权和构成。这种设计允许模型学习到单词内部的结构信息，即使单词没有在训练数据中出现也能生成其向量表示。这种新颖的方法具有快速训练的优势，能够有效地处理大规模未标注语料库，提高了模型的效率。它通过将词义和词形联系起来，增强了词向量的表达能力，这对于诸如词相似度和类比任务的自然语言理解至关重要。在九种不同语言的实验中，FastText展示了其在词语关系理解和近义词检测方面的出色性能，相比其他最近提出的基于形态学的词表征方法，它展现了更为优越的结果。总结来说，FastText是通过引入字符n-gram信息，解决了词向量模型对词形敏感度不足的问题，不仅提升了模型的泛化能力，还加快了在大规模数据上的训练速度，为多语言自然语言处理任务带来了显著的改进。"

One possible choice to deﬁne the probability of a

context word is the softmax:

p(w

| w

) =

s(w

, w

)

j=1

s(w

, j)

However, such a model is not adapted to our case as

it implies that, given a word w

, we only predict one

context word w

The problem of predicting context words can in-

stead be framed as a set of independent binary clas-

siﬁcation tasks. Then the goal is to independently

predict the presence (or absence) of context words.

For the word at position t we consider all context

words as positive examples and sample negatives at

random from the dictionary. For a chosen context

position c, using the binary logistic loss, we obtain

the following negative log-likelihood:

log



1 + e

−s(w

, w

)



n∈N

t,c

log



1 + e

s(w

, n)



where N

t,c

is a set of negative examples sampled

from the vocabulary. By denoting the logistic loss

function ` : x 7→ log(1 + e

−x

), we can re-write the

objective as:

t=1





c∈C

`(s(w

, w

)) +

n∈N

t,c

`(−s(w

, n))





A natural parameterization for the scoring function

s between a word w

and a context word w

is to use

word vectors. Let us deﬁne for each word w in the

vocabulary two vectors u

and v

in R

. These two

vectors are sometimes referred to as input and out-

put vectors in the literature. In particular, we have

vectors u

and v

, corresponding, respectively, to

words w

and w

. Then the score can be computed

as the scalar product between word and context vec-

tors as s(w

, w

) = u

. The model described

in this section is the skipgram model with negative

sampling, introduced by Mikolov et al. (2013b).

3.2 Subword model

By using a distinct vector representation for each

word, the skipgram model ignores the internal struc-

ture of words. In this section, we propose a different

scoring function s, in order to take into account this

information.

Each word w is represented as a bag of character

n-gram. We add special boundary symbols < and >

at the beginning and end of words, allowing to dis-

tinguish preﬁxes and sufﬁxes from other character

sequences. We also include the word w itself in the

set of its n-grams, to learn a representation for each

word (in addition to character n-grams). Taking the

word where and n = 3 as an example, it will be

represented by the character n-grams:

<wh, whe, her, ere, re>

and the special sequence

<where>.

Note that the sequence <her>, corresponding to the

word her is different from the tri-gram her from the

word where. In practice, we extract all the n-grams

for n greater or equal to 3 and smaller or equal to 6.

This is a very simple approach, and different sets of

n-grams could be considered, for example taking all

preﬁxes and sufﬁxes.

Suppose that you are given a dictionary of n-

grams of size G. Given a word w, let us denote by

⊂ {1, . . . , G} the set of n-grams appearing in

w. We associate a vector representation z

to each

n-gram g. We represent a word by the sum of the

vector representations of its n-grams. We thus ob-

tain the scoring function:

s(w, c) =

g∈G

This simple model allows sharing the representa-

tions across words, thus allowing to learn reliable

representation for rare words.

In order to bound the memory requirements of our

model, we use a hashing function that maps n-grams

to integers in 1 to K. We hash character sequences

using the Fowler-Noll-Vo hashing function (speciﬁ-

cally the FNV-1a variant).

We set K = 2.10

be-

low. Ultimately, a word is represented by its index

in the word dictionary and the set of hashed n-grams

it contains.

4 Experimental setup

4.1 Baseline

In most experiments (except in Sec. 5.3), we

compare our model to the C implementation

http://www.isthe.com/chongo/tech/comp/fnv

交叉熵

剩余11页未读，继续阅读

刘璐璐璐璐璐

粉丝: 36

FastText: 基于字符n-gram的快速词向量提升

精简版fastText词向量：快速掌握NLP预处理

下载Facebook预训练fastText词向量模型

食品安全文本分类模型：SVM-BERT-FastText实践教程

fastText-fastText-latest-build43.zip

词向量-中文医学词向量.zip

A_Python_interface_for_Facebook_fastText_fastText.py.zip

FastText-0.9.2.zip

Python库 | vectorhub_nightly-1.1.0.2021.1.19-py3-none-any.whl

PyPI 官网下载 | text-embeddings-0.0.5.tar.gz

PyPI 官网下载 | vectorhub-nightly-1.1.3.2021.1.29.tar.gz

最新资源