没有合适的资源?快使用搜索试试~ 我知道了~
首页FastText: 基于字符n-gram的快速词向量提升
"FastText是一种在2016年发布的先进自然语言处理技术,由Facebook AI Research团队的Piotr Bojanowski、Edouard Grave、Armand Joulin和Tomas Mikolov共同提出。该方法旨在解决传统词向量模型忽视词形信息的问题,特别是在词汇庞大且包含许多罕见词汇的语言中。FastText基于Skip-gram模型,但有所创新。 在FastText中,每个单词不再被单独表示为一个向量,而是作为一个字符n-gram(如连续的字符片段)的集合来处理。例如,单词"cat"可能被分解为'n-gram'序列'c', 'ca', 'cat',等。每个字符n-gram都对应一个向量,而单词的向量则由这些n-gram向量的加权和构成。这种设计允许模型学习到单词内部的结构信息,即使单词没有在训练数据中出现也能生成其向量表示。 这种新颖的方法具有快速训练的优势,能够有效地处理大规模未标注语料库,提高了模型的效率。它通过将词义和词形联系起来,增强了词向量的表达能力,这对于诸如词相似度和类比任务的自然语言理解至关重要。在九种不同语言的实验中,FastText展示了其在词语关系理解和近义词检测方面的出色性能,相比其他最近提出的基于形态学的词表征方法,它展现了更为优越的结果。 总结来说,FastText是通过引入字符n-gram信息,解决了词向量模型对词形敏感度不足的问题,不仅提升了模型的泛化能力,还加快了在大规模数据上的训练速度,为多语言自然语言处理任务带来了显著的改进。"
资源详情
资源推荐
One possible choice to define the probability of a
context word is the softmax:
p(w
c
| w
t
) =
e
s(w
t
, w
c
)
P
W
j=1
e
s(w
t
, j)
.
However, such a model is not adapted to our case as
it implies that, given a word w
t
, we only predict one
context word w
c
.
The problem of predicting context words can in-
stead be framed as a set of independent binary clas-
sification tasks. Then the goal is to independently
predict the presence (or absence) of context words.
For the word at position t we consider all context
words as positive examples and sample negatives at
random from the dictionary. For a chosen context
position c, using the binary logistic loss, we obtain
the following negative log-likelihood:
log
1 + e
−s(w
t
, w
c
)
+
X
n∈N
t,c
log
1 + e
s(w
t
, n)
,
where N
t,c
is a set of negative examples sampled
from the vocabulary. By denoting the logistic loss
function ` : x 7→ log(1 + e
−x
), we can re-write the
objective as:
T
X
t=1
X
c∈C
t
`(s(w
t
, w
c
)) +
X
n∈N
t,c
`(−s(w
t
, n))
.
A natural parameterization for the scoring function
s between a word w
t
and a context word w
c
is to use
word vectors. Let us define for each word w in the
vocabulary two vectors u
w
and v
w
in R
d
. These two
vectors are sometimes referred to as input and out-
put vectors in the literature. In particular, we have
vectors u
w
t
and v
w
c
, corresponding, respectively, to
words w
t
and w
c
. Then the score can be computed
as the scalar product between word and context vec-
tors as s(w
t
, w
c
) = u
>
w
t
v
w
c
. The model described
in this section is the skipgram model with negative
sampling, introduced by Mikolov et al. (2013b).
3.2 Subword model
By using a distinct vector representation for each
word, the skipgram model ignores the internal struc-
ture of words. In this section, we propose a different
scoring function s, in order to take into account this
information.
Each word w is represented as a bag of character
n-gram. We add special boundary symbols < and >
at the beginning and end of words, allowing to dis-
tinguish prefixes and suffixes from other character
sequences. We also include the word w itself in the
set of its n-grams, to learn a representation for each
word (in addition to character n-grams). Taking the
word where and n = 3 as an example, it will be
represented by the character n-grams:
<wh, whe, her, ere, re>
and the special sequence
<where>.
Note that the sequence <her>, corresponding to the
word her is different from the tri-gram her from the
word where. In practice, we extract all the n-grams
for n greater or equal to 3 and smaller or equal to 6.
This is a very simple approach, and different sets of
n-grams could be considered, for example taking all
prefixes and suffixes.
Suppose that you are given a dictionary of n-
grams of size G. Given a word w, let us denote by
G
w
⊂ {1, . . . , G} the set of n-grams appearing in
w. We associate a vector representation z
g
to each
n-gram g. We represent a word by the sum of the
vector representations of its n-grams. We thus ob-
tain the scoring function:
s(w, c) =
X
g∈G
w
z
>
g
v
c
.
This simple model allows sharing the representa-
tions across words, thus allowing to learn reliable
representation for rare words.
In order to bound the memory requirements of our
model, we use a hashing function that maps n-grams
to integers in 1 to K. We hash character sequences
using the Fowler-Noll-Vo hashing function (specifi-
cally the FNV-1a variant).
1
We set K = 2.10
6
be-
low. Ultimately, a word is represented by its index
in the word dictionary and the set of hashed n-grams
it contains.
4 Experimental setup
4.1 Baseline
In most experiments (except in Sec. 5.3), we
compare our model to the C implementation
1
http://www.isthe.com/chongo/tech/comp/fnv
交叉熵
剩余11页未读,继续阅读
刘璐璐璐璐璐
- 粉丝: 36
- 资源: 326
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- WebLogic集群配置与管理实战指南
- AIX5.3上安装Weblogic 9.2详细步骤
- 面向对象编程模拟试题详解与解析
- Flex+FMS2.0中文教程:开发流媒体应用的实践指南
- PID调节深入解析:从入门到精通
- 数字水印技术:保护版权的新防线
- 8位数码管显示24小时制数字电子钟程序设计
- Mhdd免费版详细使用教程:硬盘检测与坏道屏蔽
- 操作系统期末复习指南:进程、线程与系统调用详解
- Cognos8性能优化指南:软件参数与报表设计调优
- Cognos8开发入门:从Transformer到ReportStudio
- Cisco 6509交换机配置全面指南
- C#入门:XML基础教程与实例解析
- Matlab振动分析详解:从单自由度到6自由度模型
- Eclipse JDT中的ASTParser详解与核心类介绍
- Java程序员必备资源网站大全
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功