tried to predict a tagging sequence. Therefore, they
still need to design tagging schemas for different
NER subtasks.
Span-level classification
When applying the se-
quence labelling method to the nested NER and
discontinous NER subtasks, the tagging will be
complex (Strakov
´
a et al., 2019; Metke-Jimenez and
Karimi, 2016) or multi-level (Ju et al., 2018; Fisher
and Vlachos, 2019; Shibuya and Hovy, 2020).
Therefore, the second line of work directly con-
ducted the span-level classification. The main dif-
ference between publications in this line of work is
how to get the spans. Finkel and Manning (2009)
regarded the parsing nodes as a span. Xu et al.
(2017); Luan et al. (2019); Yamada et al. (2020); Li
et al. (2020b); Yu et al. (2020); Wang et al. (2020a)
tried to enumerate all spans. Following Lu and
Roth (2015), hypergraph methods which can effec-
tively represent exponentially many possible nested
mentions in a sentence have been extensively stud-
ied in the NER tasks (Katiyar and Cardie, 2018;
Wang and Lu, 2018; Muis and Lu, 2016).
Combined token-level and span-level classifi-
cation
To avoid enumerating all possible spans
and incorporate the entity boundary information
into the model, Wang and Lu (2019); Zheng et al.
(2019); Lin et al. (2019); Wang et al. (2020b); Luo
and Zhao (2020) proposed combining the token-
level classification and span-level classification.
2.2 Sequence-to-Sequence Models
The Seq2Seq framework has been long studied and
adopted in NLP (Sutskever et al., 2014; Cho et al.,
2014; Luong et al., 2015; Vaswani et al., 2017;
Vinyals et al., 2015). Gillick et al. (2016) pro-
posed a Seq2Seq model to predict the entity’s start,
span length and label for the NER task. Recently,
the amazing performance gain achieved by PTMs
(pre-trained models) (Qiu et al., 2020; Peters et al.,
2018; Devlin et al., 2019; Dai et al., 2021; Yan
et al., 2020) has attracted several attempts to pre-
train a Seq2Seq model (Song et al., 2019; Lewis
et al., 2020; Raffel et al., 2020). We mainly focus
on the newly proposed BART (Lewis et al., 2020)
model because it can achieve better performance
than MASS (Song et al., 2019). And the sentence-
piece tokenization used in T5 (Raffel et al., 2020)
will cause different tokenizations for the same to-
ken, making it hard to generate pointer indexes to
conduct the entity extraction.
BART is formed by several transformer encoder
and decoder layers, like the transformer model used
in the machine translation (Vaswani et al., 2017).
BART’s pre-training task is to recover corrupted
text into the original text. BART uses the encoder
to input the corrupted sentence and the decoder
to recover the original sentence. BART has base
and large versions. The base version has 6 encoder
layers and 6 decoder layers, while the large version
has 12. Therefore, the number of parameters is
similar to its equivalently sized BERT
5
.
3 Proposed Method
In this part, we first introduce the task formulation,
then we describe how we use the Seq2Seq model
with the pointer mechanism to generate the entity
index sequences. After that, we present the detailed
formulation of our model with BART.
3.1 NER Task
The three kinds of NER tasks can all be formulated
as follows, given an input sentence of
n
tokens
X = [x
1
, x
2
, ..., x
n
]
, the target sequence is
Y =
[s
11
, e
11
, ..., s
1j
, e
1j
, t
1
, ..., s
i1
, e
i1
, ..., s
ik
, e
ik
, t
i
]
,
where
s, e
are the start and end index of a span,
since an entity may contain one (for flat and
nested NER) or more than one (for discontinu-
ous NER) spans, each entity is represented as
[s
i1
, e
i1
, ..., s
ij
, e
ij
, t
i
]
, where
t
i
is the entity tag
index. We use
G = [g
1
, ..., g
l
]
to denote the entity
tag tokens (such as “Person”, “Location”, etc.),
where
l
is the number of entity tags. We make
t
i
∈ (n, n + l]
, the
n
shift is to make sure
t
i
is not
confusing with pointer indexes (pointer indexes
will be in range [1, n]).
3.2 Seq2Seq for Unified Decoding
Since we formulate the NER task in a generative
way, we can view the NER task as the following
equation:
P (Y |X) =
m
Y
t=1
P (y
t
|X, Y
<t
) (1)
where
y
0
is the special “start of sentence” control
token.
We use the Seq2Seq framework with the pointer
mechanism to tackle this task. Therefore, our
model consists of two components:
5
Because of the cross-attention between encoder and de-
coder, the number of parameters of BART is about 10% larger
than its equivalently sized of BERT (Lewis et al., 2020).