MarkBERT: Marking Word Boundaries Improves Chinese BERT
Linyang Li
2∗
,Yong Dai
1
, Duyu Tang
1†
, Zhangyin Feng
1
, Cong Zhou
1
,
Xipeng Qiu
2
, Zenglin Xu
3
, Shuming Shi
1
,
1
Tencent AI Lab, China
2
Fudan University
3
PengCheng Laboratory
{yongdai,brannzhou, aifeng,duyutang}@tencent.com,
zenglin@gmail.com, {linyangli19, xpqiu}@fudan.edu.cn
Abstract
We present a Chinese BERT model dubbed
MarkBERT that uses word information. Ex-
isting word-based BERT models regard words
as basic units, however, due to the vocabulary
limit of BERT, they only cover high-frequency
words and fall back to character level when
encountering out-of-vocabulary (OOV) words.
Different from existing works, MarkBERT
keeps the vocabulary being Chinese charac-
ters and inserts boundary markers between
contiguous words. Such design enables the
model to handle any words in the same way,
no matter they are OOV words or not. Be-
sides, our model has two additional benefits:
first, it is convenient to add word-level learn-
ing objectives over markers, which is comple-
mentary to traditional character and sentence-
level pre-training tasks; second, it can eas-
ily incorporate richer semantics such as POS
tags of words by replacing generic markers
with POS tag-specific markers. MarkBERT
pushes the state-of-the-art of Chinese named
entity recognition from 95.4% to 96.5% on
the MSRA dataset and from 82.8% to 84.2%
on the OntoNotes dataset, respectively. Com-
pared to previous word-based BERT models,
MarkBERT achieves better accuracy on text
classification, keyword recognition, and se-
mantic similarity tasks.
1 Introduction
Chinese words can be composed of multiple Chi-
nese characters. For instance, the word
地球
(earth)
is made up of two characters
地
(ground) and
球
(ball). However, there are no delimiters (i.e., space)
between words in written Chinese sentences. Tra-
ditionally, word segmentation is an important first
step for Chinese natural language processing tasks
(Chang et al., 2008). Instead, with the rise of pre-
trained models (Devlin et al., 2018), Chinese BERT
∗
Work done during internship at Tencent AI Lab.
†
Corresponding author.
models are dominated by character-based ones (Cui
et al., 2019a; Sun et al., 2019; Cui et al., 2020; Sun
et al., 2021b,a), where a sentence is represented
as a sequence of characters. There are several at-
tempts at building Chinese BERT models where
word information is considered. Existing studies
tokenize a word as a basic unit (Su, 2020), as multi-
ple characters (Cui et al., 2019a) or a combination
of both (Zhang and Li, 2020; Lai et al., 2021; Guo
et al., 2021). However, due to the limit of the vo-
cabulary size of BERT, these models only learn for
a limited number (e.g., 40K) of words with high
frequency. Rare words below the frequency thresh-
old will be tokenized as separate characters so that
the word information is neglected.
In this work, we present a simple framework,
MarkBERT, that considers Chinese word informa-
tion. Instead of regarding words as basic units, we
use character-level tokenizations and inject word
information via inserting special markers between
contiguous words. The occurrence of a marker
gives the model a hint that its previous character is
the end of a word and the following character is the
beginning of another word. Such a simple model
design has the following advantages. First, it avoids
the problem of OOV words since it deals with com-
mon words and rare words (even the words never
seen in the pre-training data) in the same way. Sec-
ond, the introduction of marker allows us to de-
sign word-level pre-training tasks (such as replaced
word detection illustrated in section 2), which are
complementary to traditional character-level pre-
training tasks like masked language modeling and
sentence-level pre-training tasks like next sentence
prediction. Third, the model is easy to be extended
to inject richer semantics of words.
In the pre-training stage, we train our model with
two pre-training tasks. The first task is masked lan-
guage modeling. We also mask markers such that
word boundary knowledge can be learned. The
second task is replaced word detection. We replace
arXiv:2203.06378v1 [cs.CL] 12 Mar 2022