深度上下文化词向量：提升NLP性能的关键

版权申诉

131 浏览量更新于2024-07-04 收藏 416KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

ken t

given the history (t

, ..., t

k−1

p(t

, t

, . . . , t

) =

k=1

p(t

| t

, t

, . . . , t

k−1

Recent state-of-the-art neural language models

ozefowicz et al., 2016; Melis et al., 2017; Mer-

ity et al., 2017) compute a context-independent to-

ken representation x

(via token embeddings or

a CNN over characters) then pass it through L lay-

ers of forward LSTMs. At each position k, each

LSTM layer outputs a context-dependent repre-

sentation

−→

k,j

where j = 1, . . . , L. The top layer

LSTM output,

−→

k,L

, is used to predict the next

token t

k+1

with a Softmax layer.

A backward LM is similar to a forward LM, ex-

cept it runs over the sequence in reverse, predict-

ing the previous token given the future context:

p(t

, t

, . . . , t

) =

k=1

p(t

| t

k+1

, t

k+2

, . . . , t

It can be implemented in an analogous way to a

forward LM, with each backward LSTM layer j

in a L layer deep model producing representations

←−

k,j

of t

given (t

k+1

, . . . , t

A biLM combines both a forward and backward

LM. Our formulation jointly maximizes the log

likelihood of the forward and backward directions:

k=1

( log p(t

| t

, . . . , t

k−1

; Θ

−→

LST M

, Θ

)

+ log p(t

| t

k+1

, . . . , t

; Θ

←−

LST M

, Θ

) ) .

We tie the parameters for both the token represen-

tation (Θ

) and Softmax layer (Θ

) in the forward

and backward direction while maintaining sepa-

rate parameters for the LSTMs in each direction.

Overall, this formulation is similar to the approach

of Peters et al. (2017), with the exception that we

share some weights between directions instead of

using completely independent parameters. In the

next section, we depart from previous work by in-

troducing a new approach for learning word rep-

resentations that are a linear combination of the

biLM layers.

3.2 ELMo

ELMo is a task speciﬁc combination of the in-

termediate layer representations in the biLM. For

each token t

, a L-layer biLM computes a set of

2L + 1 representations

= {x

−→

k,j

←−

k,j

| j = 1, . . . , L}

= {h

k,j

| j = 0, . . . , L},

where h

k,0

is the token layer and h

k,j

[

−→

k,j

;

←−

k,j

], for each biLSTM layer.

For inclusion in a downstream model, ELMo

collapses all layers in R into a single vector,

ELMo

= E(R

; Θ

). In the simplest case,

ELMo just selects the top layer, E(R

) = h

k,L

as in TagLM (Peters et al., 2017) and CoVe (Mc-

Cann et al., 2017). More generally, we compute a

task speciﬁc weighting of all biLM layers:

ELMo

task

= E(R

; Θ

task

) = γ

task

j=0

task

k,j

(1)

In (1), s

task

are softmax-normalized weights and

the scalar parameter γ

task

allows the task model to

scale the entire ELMo vector. γ is of practical im-

portance to aid the optimization process (see sup-

plemental material for details). Considering that

the activations of each biLM layer have a different

distribution, in some cases it also helped to apply

layer normalization (Ba et al., 2016) to each biLM

layer before weighting.

3.3 Using biLMs for supervised NLP tasks

Given a pre-trained biLM and a supervised archi-

tecture for a target NLP task, it is a simple process

to use the biLM to improve the task model. We

simply run the biLM and record all of the layer

representations for each word. Then, we let the

end task model learn a linear combination of these

representations, as described below.

First consider the lowest layers of the super-

vised model without the biLM. Most supervised

NLP models share a common architecture at the

lowest layers, allowing us to add ELMo in a

consistent, uniﬁed manner. Given a sequence

of tokens (t

, . . . , t

), it is standard to form a

context-independent token representation x

for

each token position using pre-trained word em-

beddings and optionally character-based represen-

tations. Then, the model forms a context-sensitive

representation h

, typically using either bidirec-

tional RNNs, CNNs, or feed forward networks.

To add ELMo to the supervised model, we

ﬁrst freeze the weights of the biLM and then

剩余14页未读，继续阅读

方案互联

粉丝: 18
资源: 926

深度上下文化词向量：提升NLP性能的关键

Deep contextualized word representations

Deep contextualized word representations翻译

Ubuntu命令手册：整理与分享OpenOffice.org办公技巧

Spark NLP：分布式自然语言处理库

自然语言处理：2小时掌握NLP基础与应用

自然语言处理：爬虫与NLP结合的应用

Binder与自然语言处理：使用Binder进行NLP交互式实验

自然语言处理：社交网络文本深层信息提取指南

deep contextualized word representations

分享几篇关于chatgpt的文档

推荐30个以上比较好的自然语言处理模型以及github源码？

请给我几个计算机方向顶刊顶会的链接

推荐30个以上比较好的nlp意图识别模型源码地址？

给我推荐20个比较流行的nlp预训练模型源码

nlp corpus

opennlp-tools

java.lang.ClassNotFoundException: com.haier.sqm.nlp.domain.AnalyzeSentence

推荐30个以上比较好的中文nlp意图识别模型源码？

最新资源