MarkBERT：增强中文BERT的词边界标记方法

需积分: 0 187 浏览量更新于2024-08-04 收藏 663KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"MarkBERT是腾讯AI Lab、复旦大学和鹏城实验室的研究人员提出的一种改进的中文BERT模型，该模型利用词的信息来提高性能。现有的基于词的BERT模型受限于词汇表，主要处理高频词汇，遇到未登录词（OOV）时则退化到字符级别处理。而MarkBERT保持词汇为中文字符，并在连续词之间插入边界标记，解决了OOV问题，并带来额外的好处，如方便添加词级学习目标和增强模型能力。" 正文: MarkBERT的核心理念在于解决中文BERT模型处理未登录词的问题。传统的BERT模型在处理中文文本时，由于词汇表限制，主要关注高频词汇，当遇到未登录词时，模型无法直接理解，需要转而处理更细粒度的字符。这种做法在一定程度上限制了模型的泛化能力和理解复杂语义的能力。 MarkBERT采取了一种创新的方法，它保持了词汇表示为中文字符，但在此基础上，会在相邻词语之间插入特定的边界标记。这样做有两个主要优势： 1. **统一处理所有词汇**：通过插入边界标记，无论词汇是否在预训练词汇表中，模型都可以以相同的方式处理。这显著改善了对未登录词的理解，提高了模型的适应性和鲁棒性。 2. **添加词级学习目标**：边界标记使得在预训练阶段可以方便地添加词级学习任务。这些任务与传统的字符和句子级别的预训练任务互补，有助于模型从不同层次捕获语言信息，提升模型的语义理解和表示能力。此外，插入的边界标记还能帮助模型更好地捕捉词序信息，这对于中文这样的序列依赖性强的语言尤为重要。在训练过程中，模型不仅能够学习字符级别的上下文关系，还可以学习到词级别的结构信息，进一步增强其对中文语境的理解。在实验部分，MarkBERT展示了在多项中文自然语言处理任务上的优越性能，包括但不限于命名实体识别、情感分析和机器翻译等。这些结果证实了其设计的有效性，并表明在处理中文文本时，考虑词边界信息对于提升深度学习模型的性能至关重要。 MarkBERT是BERT模型在中文环境下的一个重要改进，它通过引入词边界标记，解决了未登录词的问题，并增强了模型的词级理解能力。这一创新不仅适用于中文，也对其他无明确分词界限的语言的深度学习研究提供了启示。

资源详情

资源推荐

MarkBERT: Marking Word Boundaries Improves Chinese BERT

Linyang Li

2∗

,Yong Dai

, Duyu Tang

1†

, Zhangyin Feng

, Cong Zhou

Xipeng Qiu

, Zenglin Xu

, Shuming Shi

Tencent AI Lab, China

Fudan University

PengCheng Laboratory

{yongdai,brannzhou, aifeng,duyutang}@tencent.com,

zenglin@gmail.com, {linyangli19, xpqiu}@fudan.edu.cn

Abstract

We present a Chinese BERT model dubbed

MarkBERT that uses word information. Ex-

isting word-based BERT models regard words

as basic units, however, due to the vocabulary

limit of BERT, they only cover high-frequency

words and fall back to character level when

encountering out-of-vocabulary (OOV) words.

Different from existing works, MarkBERT

keeps the vocabulary being Chinese charac-

ters and inserts boundary markers between

contiguous words. Such design enables the

model to handle any words in the same way,

no matter they are OOV words or not. Be-

sides, our model has two additional beneﬁts:

ﬁrst, it is convenient to add word-level learn-

ing objectives over markers, which is comple-

mentary to traditional character and sentence-

level pre-training tasks; second, it can eas-

ily incorporate richer semantics such as POS

tags of words by replacing generic markers

with POS tag-speciﬁc markers. MarkBERT

pushes the state-of-the-art of Chinese named

entity recognition from 95.4% to 96.5% on

the MSRA dataset and from 82.8% to 84.2%

on the OntoNotes dataset, respectively. Com-

pared to previous word-based BERT models,

MarkBERT achieves better accuracy on text

classiﬁcation, keyword recognition, and se-

mantic similarity tasks.

1 Introduction

Chinese words can be composed of multiple Chi-

nese characters. For instance, the word

地球

(earth)

is made up of two characters

地

(ground) and

球

(ball). However, there are no delimiters (i.e., space)

between words in written Chinese sentences. Tra-

ditionally, word segmentation is an important ﬁrst

step for Chinese natural language processing tasks

(Chang et al., 2008). Instead, with the rise of pre-

trained models (Devlin et al., 2018), Chinese BERT

∗

Work done during internship at Tencent AI Lab.

†

Corresponding author.

models are dominated by character-based ones (Cui

et al., 2019a; Sun et al., 2019; Cui et al., 2020; Sun

et al., 2021b,a), where a sentence is represented

as a sequence of characters. There are several at-

tempts at building Chinese BERT models where

word information is considered. Existing studies

tokenize a word as a basic unit (Su, 2020), as multi-

ple characters (Cui et al., 2019a) or a combination

of both (Zhang and Li, 2020; Lai et al., 2021; Guo

et al., 2021). However, due to the limit of the vo-

cabulary size of BERT, these models only learn for

a limited number (e.g., 40K) of words with high

frequency. Rare words below the frequency thresh-

old will be tokenized as separate characters so that

the word information is neglected.

In this work, we present a simple framework,

MarkBERT, that considers Chinese word informa-

tion. Instead of regarding words as basic units, we

use character-level tokenizations and inject word

information via inserting special markers between

contiguous words. The occurrence of a marker

gives the model a hint that its previous character is

the end of a word and the following character is the

beginning of another word. Such a simple model

design has the following advantages. First, it avoids

the problem of OOV words since it deals with com-

mon words and rare words (even the words never

seen in the pre-training data) in the same way. Sec-

ond, the introduction of marker allows us to de-

sign word-level pre-training tasks (such as replaced

word detection illustrated in section 2), which are

complementary to traditional character-level pre-

training tasks like masked language modeling and

sentence-level pre-training tasks like next sentence

prediction. Third, the model is easy to be extended

to inject richer semantics of words.

In the pre-training stage, we train our model with

two pre-training tasks. The ﬁrst task is masked lan-

guage modeling. We also mask markers such that

word boundary knowledge can be learned. The

second task is replaced word detection. We replace

arXiv:2203.06378v1 [cs.CL] 12 Mar 2022

下载后可阅读完整内容，剩余9页未读，立即下载

justdoitnow

粉丝: 2745
资源: 7

MarkBERT：增强中文BERT的词边界标记方法

最新资源