没有合适的资源?快使用搜索试试~ 我知道了~
首页深度上下文化词向量:提升NLP性能的关键
"《NLP:深度上下文化词表示》一文探讨了一种新颖的自然语言处理技术,它引入了深度上下文化词向量,这是一种在预训练双向语言模型(biLM)内部状态的基础上学习得到的词表示方法。与传统的词嵌入模型如word2vec不同,这些词向量不仅捕捉到了单词的基本语法和语义特性,还能够反映出在不同语言上下文中词汇使用的多样性,即处理一词多义问题。 传统的word2vec模型是在独立的语料库上进行训练,词向量通常是静态的,缺乏对特定任务或语境的适应性。而在本文中,作者通过将词向量视为biLM内部状态的函数,使得词向量能根据上下文动态变化,更好地反映了词语在不同情境下的含义。这种深度上下文化的词表示方法允许模型在保持语义信息的同时,适应不同的NLP任务,比如问答、文本蕴含和情感分析等。 研究者展示了如何将这些深度词向量无缝融入现有的模型架构中,并显著提升了在六个挑战性的NLP任务上的性能。这表明,深度语言模型的预训练过程对于获取更丰富的语义信息至关重要。通过揭示预训练网络的深层结构,研究人员能够更好地利用模型的内在能力,从而实现更准确和灵活的语言理解。 总结来说,这篇论文的核心贡献在于提出了一种新型的深度学习方法,通过深度上下文化词向量来捕捉单词的多义性和语境依赖性,极大地提高了自然语言处理任务的性能。同时,它强调了深度语言模型内部机制的理解和利用对于提升NLP应用效果的重要性。"
资源详情
资源推荐
ken t
k
given the history (t
1
, ..., t
k−1
):
p(t
1
, t
2
, . . . , t
N
) =
N
Y
k=1
p(t
k
| t
1
, t
2
, . . . , t
k−1
).
Recent state-of-the-art neural language models
(J
´
ozefowicz et al., 2016; Melis et al., 2017; Mer-
ity et al., 2017) compute a context-independent to-
ken representation x
LM
k
(via token embeddings or
a CNN over characters) then pass it through L lay-
ers of forward LSTMs. At each position k, each
LSTM layer outputs a context-dependent repre-
sentation
−→
h
LM
k,j
where j = 1, . . . , L. The top layer
LSTM output,
−→
h
LM
k,L
, is used to predict the next
token t
k+1
with a Softmax layer.
A backward LM is similar to a forward LM, ex-
cept it runs over the sequence in reverse, predict-
ing the previous token given the future context:
p(t
1
, t
2
, . . . , t
N
) =
N
Y
k=1
p(t
k
| t
k+1
, t
k+2
, . . . , t
N
).
It can be implemented in an analogous way to a
forward LM, with each backward LSTM layer j
in a L layer deep model producing representations
←−
h
LM
k,j
of t
k
given (t
k+1
, . . . , t
N
).
A biLM combines both a forward and backward
LM. Our formulation jointly maximizes the log
likelihood of the forward and backward directions:
N
X
k=1
( log p(t
k
| t
1
, . . . , t
k−1
; Θ
x
,
−→
Θ
LST M
, Θ
s
)
+ log p(t
k
| t
k+1
, . . . , t
N
; Θ
x
,
←−
Θ
LST M
, Θ
s
) ) .
We tie the parameters for both the token represen-
tation (Θ
x
) and Softmax layer (Θ
s
) in the forward
and backward direction while maintaining sepa-
rate parameters for the LSTMs in each direction.
Overall, this formulation is similar to the approach
of Peters et al. (2017), with the exception that we
share some weights between directions instead of
using completely independent parameters. In the
next section, we depart from previous work by in-
troducing a new approach for learning word rep-
resentations that are a linear combination of the
biLM layers.
3.2 ELMo
ELMo is a task specific combination of the in-
termediate layer representations in the biLM. For
each token t
k
, a L-layer biLM computes a set of
2L + 1 representations
R
k
= {x
LM
k
,
−→
h
LM
k,j
,
←−
h
LM
k,j
| j = 1, . . . , L}
= {h
LM
k,j
| j = 0, . . . , L},
where h
LM
k,0
is the token layer and h
LM
k,j
=
[
−→
h
LM
k,j
;
←−
h
LM
k,j
], for each biLSTM layer.
For inclusion in a downstream model, ELMo
collapses all layers in R into a single vector,
ELMo
k
= E(R
k
; Θ
e
). In the simplest case,
ELMo just selects the top layer, E(R
k
) = h
LM
k,L
,
as in TagLM (Peters et al., 2017) and CoVe (Mc-
Cann et al., 2017). More generally, we compute a
task specific weighting of all biLM layers:
ELMo
task
k
= E(R
k
; Θ
task
) = γ
task
L
X
j=0
s
task
j
h
LM
k,j
.
(1)
In (1), s
task
are softmax-normalized weights and
the scalar parameter γ
task
allows the task model to
scale the entire ELMo vector. γ is of practical im-
portance to aid the optimization process (see sup-
plemental material for details). Considering that
the activations of each biLM layer have a different
distribution, in some cases it also helped to apply
layer normalization (Ba et al., 2016) to each biLM
layer before weighting.
3.3 Using biLMs for supervised NLP tasks
Given a pre-trained biLM and a supervised archi-
tecture for a target NLP task, it is a simple process
to use the biLM to improve the task model. We
simply run the biLM and record all of the layer
representations for each word. Then, we let the
end task model learn a linear combination of these
representations, as described below.
First consider the lowest layers of the super-
vised model without the biLM. Most supervised
NLP models share a common architecture at the
lowest layers, allowing us to add ELMo in a
consistent, unified manner. Given a sequence
of tokens (t
1
, . . . , t
N
), it is standard to form a
context-independent token representation x
k
for
each token position using pre-trained word em-
beddings and optionally character-based represen-
tations. Then, the model forms a context-sensitive
representation h
k
, typically using either bidirec-
tional RNNs, CNNs, or feed forward networks.
To add ELMo to the supervised model, we
first freeze the weights of the biLM and then
剩余14页未读,继续阅读
方案互联
- 粉丝: 18
- 资源: 926
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 解决Eclipse配置与导入Java工程常见问题
- 真空发生器:工作原理与抽吸性能分析
- 爱立信RBS6201开站流程详解
- 电脑开机声音解析:故障诊断指南
- JAVA实现贪吃蛇游戏
- 模糊神经网络实现与自学习能力探索
- PID型模糊神经网络控制器设计与学习算法
- 模糊神经网络在自适应PID控制器中的应用
- C++实现的学生成绩管理系统设计
- 802.1D STP 实现与优化:二层交换机中的生成树协议
- 解决Windows无法完成SD卡格式化的九种方法
- 软件测试方法:Beta与Alpha测试详解
- 软件测试周期详解:从需求分析到维护测试
- CMMI模型详解:软件企业能力提升的关键
- 移动Web开发框架选择:jQueryMobile、jQTouch、SenchaTouch对比
- Java程序设计试题与复习指南
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功