深度学习驱动的NLP：2018年趋势综述

需积分: 13 124 浏览量更新于2024-07-17 3 收藏 3.46MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"这篇综述文章《Recent Trends in Deep Learning Based Natural Language Processing》探讨了2018年自然语言处理（NLP）领域基于深度学习的发展趋势，由多位来自中国、新加坡知名高校的研究者共同撰写。文章深入分析了深度学习在NLP任务中的应用，回顾了相关模型和方法的演变，并对未来进行了展望。" 正文: 深度学习在2018年的自然语言处理领域取得了显著进展，其核心在于利用多层处理结构学习数据的层次表示。这种方法在语音识别、机器翻译、情感分析等多个NLP任务中已展现出最先进的性能。本文对这些领域的关键深度学习模型和方法进行了详尽的总结和对比。首先，词嵌入技术如Word2Vec是深度学习应用于NLP的基础。Word2Vec通过学习词汇的上下文关系，生成词向量，使得词与词之间的语义关系得以量化，从而提高了模型理解文本的能力。其次，注意力机制（Attention）的引入极大地改善了模型处理长序列信息的能力。传统的循环神经网络（RNN）在处理长距离依赖时可能遇到梯度消失或爆炸的问题，而注意力机制允许模型在不同时间步之间动态分配权重，解决了这一问题。Transformer模型更是将注意力机制发挥到极致，完全依赖于注意力机制进行序列建模，显著提升了模型的并行计算能力和性能。此外，递归神经网络（Recursive Neural Networks）和卷积神经网络（Convolutional Neural Networks）也是NLP中的重要工具。递归神经网络通过树形结构处理自然语言的句法结构，而卷积神经网络在文本分类和情感分析等任务中展现出了高效特征提取能力。长短期记忆网络（LSTM）和门控循环单元（GRU）是RNN的变体，它们通过引入门控机制解决了传统RNN的问题，使得长期依赖信息得以有效保留。这两种模型在NLP中的应用广泛，尤其是在序列标注和语言模型生成方面。深度学习在NLP的另一个重要应用是生成式对话系统。这些系统利用深度学习模型模拟人类对话，例如seq2seq模型结合注意力机制，可以生成连贯、有逻辑的回应。文章还讨论了预训练模型如BERT和GPT的崛起，这些模型在大规模无标注文本上预训练后，能够有效地迁移学习到各种下游任务，显著提高了模型的泛化能力。最后，作者们对NLP的未来趋势进行了预测，包括更强大的预训练模型、更高效的计算架构、以及更深入的跨学科融合，如认知科学、社会学和心理学的理论与方法将被引入NLP研究，以更好地理解和模拟人类的语言行为。《Recent Trends in Deep Learning Based Natural Language Processing》全面梳理了深度学习在NLP中的重要进展，为研究者和从业者提供了深入理解该领域发展脉络的宝贵资源。

资源详情

资源推荐

of the same word. Speciﬁcally, for N different sentences where a word w is present, ELMo generates N different representations

of w i.e., w

, w

, ˙,w

The mechanism of ELMo is based on the representation obtained from a bidirectional language model. A bidirectional

language model (biLM) constitutes of two language models (LM) 1) forward LM and 2) backward LM. A forward LM takes

input representation x

for each of the k

token and passes it through L layers of forward LSTM to get representations

−→

k,j

where j = 1, . . . , L. Each of these representations, being hidden representations of recurrent neural networks, is context

dependent. A forward LM can be seen as a method to model the joint probability of a sequence of tokens: p (t

, t

, . . . , t

) =

k=1

p (t

, t

, . . . , t

k−1

). At a timestep k −1 the forward LM predicts the next token t

given the previous observed tokens

, t

, ..., t

. This is typically achieved by placing a softmax layer on top of the ﬁnal LSTM in a forward LM. On the other

hand, a backward LM models the same joint probability of the sequence by predicting the previous token given the future

tokens: p (t

, t

, . . . , t

) =

k=1

p (t

k+1

, t

k+2

, . . . , t

). In other words, a backward LM is similar to forward LM which

processes a sequence with the order being reversed. The training of the biLM model involves modeling the log-likelihood of

both the sentence orientations. Finally, hidden representations from both LMs are concetenated to compose the ﬁnal token

vectors [42].

For each tokem, ELMo extracts the intermediate layer representations from the biLM and performs a linear combination

based on the given downstream task. A L-layer biLM contains 2L + 1 set of representations as shown below -

−→

k,j

←−

k,j

|j = 1, . . . , L



k,j

|j = 0, . . . , L



(6)

Here, h

k,0

is the token representation at the lowest level. One can use either character or word embeddings to initialize

k,0

. For other values of j,

k,j

−→

k,j

←−

k,j

∀j = 1, . . . , L. (7)

ELMo ﬂattens all layers in R in a single vector such that -

ELMo

task

= E



; Θ

task



= γ

task

j=0

task

k,j

(8)

In Eq. 8, s

task

is the softmax-normalized weight vector to combine the representations of different layers. γ

task

is a hyper-

parameter which helps in optimization and task speciﬁc scaling of the ELMo representation. ELMo produces varied word

representations for the same word in different sentences. According to Peters et al. [41], it is always beneﬁcial to combine

ELMo word representations with standard global word representations like Glove and Word2Vec.

Off-late, there has been a surge of interest in pre-trained language models for myriad of natural language tasks [43].

Language modeling is chosen as the pre-training objective as it is widely considered to incorporate multiple traits of natual

language understanding and generation. A good language model requires learning complex characteristics of language involving

syntactical properties and also semantical coherence. Thus, it is believed that unsupervised training on such objectives would

infuse better linguistic knowledge into the networks than random initialization. The generative pre-training and discriminative

ﬁne-tuning procedure is also desirable as the pre-training is unsupervised and does not require any manual labeling.

Radford et al. [44] proposed similar pre-trained model, the OpenAI-GPT, by adapting the Transformer (see section IV-E).

Recently, Devlin et al. [45] proposed BERT which utilizes a transformer network to pre-train a language model for extracting

contextual word embeddings. Unlike ELMo and OpenAI-GPT, BERT uses different pre-training tasks for language modeling.

In one of the tasks, BERT randomly masks a percentage of words in the sentences and only predicts those masked words. In

the other task, BERT predicts the next sentence given a sentence. This task in particular tries to model the relationship among

two sentences which is supposedly not captured by traditional bidirectional language models. Consequently, this particular

pre-training scheme helps BERT to outperform state-of-the-art techniques by a large margin on key NLP tasks such as QA,

Natural Language Inference (NLI) where understanding relation among two sentences is very important. We discuss the impact

of these proposed models and the performance achieved by them in section VIII-I.

The described approaches for contextual word embeddings promises better quality representations for words. The pre-trained

deep language models also provide a headstart for downstream tasks in the form of transfer learning. This approach has been

extremely popular in computer vision tasks. Whether there would be similar trends in the NLP community, where researchers

and practitioners would prefer such models over traditional variants remains to be seen in the future.

III. CONVOLUTIONAL NEURAL NETWORKS

Following the popularization of word embeddings and its ability to represent words in a distributed space, the need arose

for an effective feature function that extracts higher-level features from constituting words or n-grams. These abstract features

would then be used for numerous NLP tasks such as sentiment analysis, summarization, machine translation, and question

answering (QA). CNNs turned out to be the natural choice given their effectiveness in computer vision tasks [46, 47, 48].

N−1

Input

Sentence

Lookup table

Feature 1

Feature k

Convolution

layer

Max-pool

over time

Fully Connected Layer

Softmax Classiﬁcation

Fig. 5: CNN framework used to perform word wise class prediction (Figure source: Collobert and Weston [19])

The use of CNNs for sentence modeling traces back to Collobert and Weston [19]. This work used multi-task learning to

output multiple predictions for NLP tasks such as POS tags, chunks, named-entity tags, semantic roles, semantically-similar

words and a language model. A look-up table was used to transform each word into a vector of user-deﬁned dimensions.

Thus, an input sequence {s

, s

, ...s

} of n words was transformed into a series of vectors {w

, w

, ...w

} by applying

the look-up table to each of its words (Fig. 5).

This can be thought of as a primitive word embedding method whose weights were learned in the training of the network.

In [5], Collobert extended his work to propose a general CNN-based framework to solve a plethora of NLP tasks. Both these

works triggered a huge popularization of CNNs amongst NLP researchers. Given that CNNs had already shown their mettle

for computer vision tasks, it was easier for people to believe in their performance.

CNNs have the ability to extract salient n-gram features from the input sentence to create an informative latent semantic

representation of the sentence for downstream tasks. This application was pioneered by Collobert et al. [5], Kalchbrenner et al.

[49], Kim [50], which led to a huge proliferation of CNN-based networks in the succeeding literature. Below, we describe the

working of a simple CNN-based sentence modeling network:

A. Basic CNN

1) Sentence Modeling: For each sentence, let w

∈ R

represent the word embedding for the i

word in the sentence,

where d is the dimension of the word embedding. Given that a sentence has n words, the sentence can now be represented as

an embedding matrix W ∈ R

n×d

. Fig. 6 depicts such a sentence as an input to the CNN framework.

Let w

i:i+j

refer to the concatenation of vectors w

, w

i+1

, ...w

. Convolution is performed on this input embedding layer.

It involves a ﬁlter k ∈ R

which is applied to a window of h words to produce a new feature. For example, a feature c

generated using the window of words w

i:i+h−1

= f(w

i:i+h−1

+ b) (9)

Here, b ∈ R is the bias term and f is a non-linear activation function, for example the hyperbolic tangent. The ﬁlter k is

applied to all possible windows using the same weights to create the feature map.

c = [c

, c

, ..., c

n−h+1

] (10)

In a CNN, a number of convolutional ﬁlters, also called kernels (typically hundreds), of different widths slide over the

entire word embedding matrix. Each kernel extracts a speciﬁc pattern of n-gram. A convolution layer is usually followed by

a max-pooling strategy, ˆc = max{c}, which subsamples the input typically by applying a max operation on each ﬁlter. This

strategy has two primary reasons.

剩余31页未读，继续阅读

NLP_victor

粉丝: 110
资源: 1

深度学习驱动的NLP：2018年趋势综述

NLP技术综述

自然语言处理综述第三版

自然语言处理的十个发展趋势

走进NLP的世界——NLP综述

中文自然语言处理的研究现状和发展趋势

中文信息发展处理报告（自然语言处理NLP的内容）

基于微信小程序校园订餐的+ssm演示+源代码+开题报告+演示视频.zip

java项目实战练习.zip

Java项目-基于微信小程序的基于校园作业反馈的家校联系微信小程序（包括源码，数据库，教程）.zip

java抗疫物质管理系统设计和实现论文.docx

MPC与LQR最优控制的区别2

阿里云_dysms_官方_SDK_的_Composer_封装，支持_yii2_项目。_yii2-dysms.zip

TrollReStore 巨魔17.0安装工具下载windows版本

毕业答辩模板极简高逼格创意设计毕业答辩实用模板

FPGA-EGO1开发板-LCD1602显示

CCT turning guide

2021_9_2Qt-main.zip

Java项目-基于微信小程序的药店管理系统（包括源码，数据库，教程）.zip

关于创业的一些认知，比较有深度，讲得也比较浅显易懂

技术方案资料技术方案资料Hadoop技术资料.zip

最新资源