End-to-endSequenceLabelingviaBi-directionalLSTM-CNNs-CRF

LSTM

序列标注

需积分: 37 200 浏览量更新于2023-03-16 评论收藏 370KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

Xuezhe Ma and Eduard Hovy

Language Technologies Institute

Carnegie Mellon University

Pittsburgh, PA 15213, USA

xuezhem@cs.cmu.edu, ehovy@cmu.edu

Abstract

State-of-the-art sequence labeling systems

traditionally require large amounts of task-

speciﬁc knowledge in the form of hand-

crafted features and data pre-processing.

In this paper, we introduce a novel neu-

tral network architecture that beneﬁts from

both word- and character-level representa-

tions automatically, by using combination

of bidirectional LSTM, CNN and CRF.

Our system is truly end-to-end, requir-

ing no feature engineering or data pre-

processing, thus making it applicable to

a wide range of sequence labeling tasks.

We evaluate our system on two data sets

for two sequence labeling tasks — Penn

Treebank WSJ corpus for part-of-speech

(POS) tagging and CoNLL 2003 cor-

pus for named entity recognition (NER).

We obtain state-of-the-art performance on

both datasets — 97.55% accuracy for POS

tagging and 91.21% F1 for NER.

1 Introduction

Linguistic sequence labeling, such as part-of-

speech (POS) tagging and named entity recogni-

tion (NER), is one of the ﬁrst stages in deep lan-

guage understanding and its importance has been

well recognized in the natural language processing

community. Natural language processing (NLP)

systems, like syntactic parsing (Nivre and Scholz,

2004; McDonald et al., 2005; Koo and Collins,

2010; Ma and Zhao, 2012a; Ma and Zhao, 2012b;

Chen and Manning, 2014; Ma and Hovy, 2015)

and entity coreference resolution (Ng, 2010; Ma

et al., 2016), are becoming more sophisticated,

in part because of utilizing output information of

POS tagging or NER systems.

Most traditional high performance sequence la-

beling models are linear statistical models, includ-

ing Hidden Markov Models (HMM) and Condi-

tional Random Fields (CRF) (Ratinov and Roth,

2009; Passos et al., 2014; Luo et al., 2015), which

rely heavily on hand-crafted features and task-

speciﬁc resources. For example, English POS tag-

gers beneﬁt from carefully designed word spelling

features; orthographic features and external re-

sources such as gazetteers are widely used in NER.

However, such task-speciﬁc knowledge is costly

to develop (Ma and Xia, 2014), making sequence

labeling models difﬁcult to adapt to new tasks or

new domains.

In the past few years, non-linear neural net-

works with as input distributed word representa-

tions, also known as word embeddings, have been

broadly applied to NLP problems with great suc-

cess. Collobert et al. (2011) proposed a simple but

effective feed-forward neutral network that inde-

pendently classiﬁes labels for each word by us-

ing contexts within a window with ﬁxed size. Re-

cently, recurrent neural networks (RNN) (Goller

and Kuchler, 1996), together with its variants such

as long-short term memory (LSTM) (Hochreiter

and Schmidhuber, 1997; Gers et al., 2000) and

gated recurrent unit (GRU) (Cho et al., 2014),

have shown great success in modeling sequential

data. Several RNN-based neural network mod-

els have been proposed to solve sequence labeling

tasks like speech recognition (Graves et al., 2013),

POS tagging (Huang et al., 2015) and NER (Chiu

and Nichols, 2015; Hu et al., 2016), achieving

competitive performance against traditional mod-

els. However, even systems that have utilized dis-

tributed representations as inputs have used these

to augment, rather than replace, hand-crafted fea-

tures (e.g. word spelling and capitalization pat-

terns). Their performance drops rapidly when the

models solely depend on neural embeddings.

arXiv:1603.01354v5 [cs.LG] 29 May 2016

In this paper, we propose a neural network ar-

chitecture for sequence labeling. It is a truly end-

to-end model requiring no task-speciﬁc resources,

feature engineering, or data pre-processing be-

yond pre-trained word embeddings on unlabeled

corpora. Thus, our model can be easily applied

to a wide range of sequence labeling tasks on dif-

ferent languages and domains. We ﬁrst use con-

volutional neural networks (CNNs) (LeCun et al.,

1989) to encode character-level information of a

word into its character-level representation. Then

we combine character- and word-level represen-

tations and feed them into bi-directional LSTM

(BLSTM) to model context information of each

word. On top of BLSTM, we use a sequential

CRF to jointly decode labels for the whole sen-

tence. We evaluate our model on two linguistic

sequence labeling tasks — POS tagging on Penn

Treebank WSJ (Marcus et al., 1993), and NER

on English data from the CoNLL 2003 shared

task (Tjong Kim Sang and De Meulder, 2003).

Our end-to-end model outperforms previous state-

of-the-art systems, obtaining 97.55% accuracy for

POS tagging and 91.21% F1 for NER. The con-

tributions of this work are (i) proposing a novel

neural network architecture for linguistic sequence

labeling. (ii) giving empirical evaluations of this

model on benchmark data sets for two classic NLP

tasks. (iii) achieving state-of-the-art performance

with this truly end-to-end system.

2 Neural Network Architecture

In this section, we describe the components (lay-

ers) of our neural network architecture. We intro-

duce the neural layers in our neural network one-

by-one from bottom to top.

2.1 CNN for Character-level Representation

Previous studies (Santos and Zadrozny, 2014;

Chiu and Nichols, 2015) have shown that CNN

is an effective approach to extract morphological

information (like the preﬁx or sufﬁx of a word)

from characters of words and encode it into neural

representations. Figure 1 shows the CNN we use

to extract character-level representation of a given

word. The CNN is similar to the one in Chiu and

Nichols (2015), except that we use only character

embeddings as the inputs to CNN, without char-

acter type features. A dropout layer (Srivastava et

al., 2014) is applied before character embeddings

are input to CNN.

P l a y i n g PaddingPadding

Char

Embedding

Convolution

Max Pooling

Char

Representation

Figure 1: The convolution neural network for ex-

tracting character-level representations of words.

Dashed arrows indicate a dropout layer applied be-

fore character embeddings are input to CNN.

2.2 Bi-directional LSTM

2.2.1 LSTM Unit

Recurrent neural networks (RNNs) are a powerful

family of connectionist models that capture time

dynamics via cycles in the graph. Though, in the-

ory, RNNs are capable to capturing long-distance

dependencies, in practice, they fail due to the gra-

dient vanishing/exploding problems (Bengio et al.,

1994; Pascanu et al., 2012).

LSTMs (Hochreiter and Schmidhuber, 1997)

are variants of RNNs designed to cope with these

gradient vanishing problems. Basically, a LSTM

unit is composed of three multiplicative gates

which control the proportions of information to

forget and to pass on to the next time step. Fig-

ure 2 gives the basic structure of an LSTM unit.

Figure 2: Schematic of LSTM unit.

Formally, the formulas to update an LSTM unit

at time t are:

= σ(W

t−1

+ U

+ b

)

= σ(W

t−1

+ U

+ b

)

˜c

= tanh(W

t−1

+ U

+ b

)

= f

 c

t−1

+ i

 ˜c

= σ(W

t−1

+ U

+ b

)

= o

 tanh(c

)

where σ is the element-wise sigmoid function

and  is the element-wise product. x

is the

input vector (e.g. word embedding) at time

t, and h

is the hidden state (also called out-

put) vector storing all the useful information at

(and before) time t. U

, U

denote the

weight matrices of different gates for input x

and W

, W

are the weight matrices

for hidden state h

. b

, b

denote the bias

vectors. It should be noted that we do not include

peephole connections (Gers et al., 2003) in the our

LSTM formulation.

2.2.2 BLSTM

For many sequence labeling tasks it is beneﬁ-

cial to have access to both past (left) and future

(right) contexts. However, the LSTM’s hidden

state h

takes information only from past, know-

ing nothing about the future. An elegant solution

whose effectiveness has been proven by previous

work (Dyer et al., 2015) is bi-directional LSTM

(BLSTM). The basic idea is to present each se-

quence forwards and backwards to two separate

hidden states to capture past and future informa-

tion, respectively. Then the two hidden states are

concatenated to form the ﬁnal output.

2.3 CRF

For sequence labeling (or general structured pre-

diction) tasks, it is beneﬁcial to consider the cor-

relations between labels in neighborhoods and

jointly decode the best chain of labels for a given

input sentence. For example, in POS tagging an

adjective is more likely to be followed by a noun

than a verb, and in NER with standard BIO2 an-

notation (Tjong Kim Sang and Veenstra, 1999)

I-ORG cannot follow I-PER. Therefore, we model

label sequence jointly using a conditional random

ﬁeld (CRF) (Lafferty et al., 2001), instead of de-

coding each label independently.

Formally, we use z = {z

, · · · , z

} to repre-

sent a generic input sequence where z

is the input

vector of the ith word. y = {y

, · · · , y

} rep-

resents a generic sequence of labels for z. Y(z)

denotes the set of possible label sequences for z.

The probabilistic model for sequence CRF deﬁnes

a family of conditional probability p(y|z; W, b)

over all possible label sequences y given z with

the following form:

p(y|z; W, b) =

i=1

i−1

, y

, z)

∈Y(z)

i=1

i−1

, y

, z)

where ψ

, y, z) = exp(W

+ b

) are

potential functions, and W

and b

are the

weight vector and bias corresponding to label pair

, y), respectively.

For CRF training, we use the maximum con-

ditional likelihood estimation. For a training set

{(z

, y

)}, the logarithm of the likelihood (a.k.a.

the log-likelihood) is given by:

L(W, b) =

log p(y|z; W, b)

Maximum likelihood training chooses parameters

such that the log-likelihood L(W, b) is maxi-

mized.

Decoding is to search for the label sequence y

∗

with the highest conditional probability:

∗

= argmax

y∈Y(z)

p(y|z; W, b)

For a sequence CRF model (only interactions be-

tween two successive labels are considered), train-

ing and decoding can be solved efﬁciently by

adopting the Viterbi algorithm.

2.4 BLSTM-CNNs-CRF

Finally, we construct our neural network model by

feeding the output vectors of BLSTM into a CRF

layer. Figure 3 illustrates the architecture of our

network in detail.

For each word, the character-level represen-

tation is computed by the CNN in Figure 1

with character embeddings as inputs. Then the

character-level representation vector is concate-

nated with the word embedding vector to feed into

the BLSTM network. Finally, the output vectors

of BLSTM are fed to the CRF layer to jointly de-

code the best label sequence. As shown in Fig-

ure 3, dropout layers are applied on both the in-

put and output vectors of BLSTM. Experimen-

tal results show that using dropout signiﬁcantly

剩余11页未读，继续阅读

beaujor

粉丝: 5
资源: 12

会员权益专享

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

评论0

会员权益专享

最新资源

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

评论0

论文《End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF》的代码实现

Empower Sequence Labeling with Task-Aware Neural Language Model

LSTM-CNNs-CRF.rar

bert-bilstm-crf

写一个BERT-LTP-BILSTM-CRF的命名实体识别算法

写一个bert-bilstm-crf算例

帮我写一个bert-bilstm-crf-ner模型用于中文命名实体识别

CNN-BILSTM-CRF实体识别python代码

BILSTM-CRF代码

CNN-LSTM与EnDecoder框架的CNN-LSTM有何区别，优缺点

lstm-crf模型代码

bilstm-crf命名实体识别用python实现代码

bilstm-crf实体关系抽取模型输出预测结果的代码

bert bilstm crf模型代码

cnn-lstm-seq2seq

cnn-bilstm-attentionmatlab实现

rapidminer CRF-learn++

Sequence Labeling Sequence Classification Sequence Extraction Multi-label Text Classification

利用torch构建RoBerta-BiLSTM-Attention模型的代码

推荐30个以上比较好的命名实体识别github源码？

会员权益专享

最新资源