Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2908–2917
November 7–11, 2021. ©2021 Association for Computational Linguistics
2908
Lexicon-Based Graph Convolutional Network for Chinese Word
Segmentation
Kaiyu Huang Hao Yu Junpeng Liu
Wei Liu Jingxiang Cao Degen Huang
∗
School of Computer Science, Dalian University of Technology
{kaiyuhuang, yuhaodlut,liujunpeng_nlp, liuweidlut}
@mail.dlut.edu.cn
{caojx, huangdg}@dlut.edu.cn
Abstract
Precise information of word boundary can al-
leviate the problem of lexical ambiguity to
improve the performance of natural language
processing (NLP) tasks. Thus, Chinese word
segmentation (CWS) is a fundamental task in
NLP. Due to the development of pre-trained
language models (PLM), pre-trained knowl-
edge can help neural methods solve the main
problems of the CWS in significant mea-
sure. Existing methods have already achieved
high performance on several benchmarks (e.g.,
Bakeoff-2005). However, recent outstanding
studies are limited by the small-scale anno-
tated corpus. To further improve the perfor-
mance of CWS methods based on fine-tuning
the PLMs, we propose a novel neural frame-
work, LBGCN, which incorporates a lexicon-
based graph convolutional network into the
Transformer encoder. Experimental results
on five benchmarks and four cross-domain
datasets show the LBGCN successfully cap-
tures the information of candidate words and
helps to improve performance on the bench-
marks (Bakeoff-2005 and CTB6) and the
cross-domain datasets (SIGHAN-2010). Fur-
ther experiments and analyses demonstrate
that our proposed framework effectively mod-
els the lexicon to enhance the ability of basic
neural frameworks and strengthens the robust-
ness in the cross-domain scenario.
1
1 Introduction
Neural methods often leverage word-level informa-
tion to improve the performance of many down-
stream natural language processing (NLP) tasks
such as text classification and machine translation
(Yang et al., 2018), etc. Therefore, in determining
the word boundary, word segmentation is regarded
as a prerequisite for most downstream NLP tasks.
∗
Corresponding author
1
Source codes of this paper are available on
https://
github.com/koukaiu/lbgcn
Unlike most written languages, the Chinese writ-
ten language has no explicit delimiters to separate
words in the written text. Thus, Chinese word seg-
mentation (CWS) is an essential and pre-processing
step for many Chinese NLP tasks.
With the development of deep learning tech-
niques, recent neural CWS approaches that do not
heavily rely on the hand-craft feature engineering
have already achieved high performance on several
benchmark datasets (Cai and Zhao, 2016; Cai et al.,
2017; Ma et al., 2018). In particular, recent out-
standing studies have also exploited the learning
paradigm in applying pre-trained language mod-
els (PLM) for many NLP tasks. Various methods
that fine-tune PLMs have achieved progress on in-
domain and cross-domain CWS without much man-
ual effort (Meng et al., 2019; Huang et al., 2020;
Tian et al., 2020; Ke et al., 2021).
Prior research has shown that the problems
of CWS are segmentation ambiguity and out-of-
vocabulary (OOV) words (Zhao et al., 2019). With
the help of the pre-trained knowledge (Devlin et al.,
2018; Liu et al., 2019), the fine-tuning CWS meth-
ods can effectively alleviate these two issues and
outperform other neural network architectures. The
methods fine-tuning PLMs become the mainstream
approach for CWS. However, the performance of
fine-tuning CWS methods is limited by the scale
and quality of annotated CWS corpus. The depen-
dencies between neighboring Chinese characters
are diverse and it is hard to build a large-scale an-
notated corpus because of the characteristics of
linguistics in Chinese. The difficulty of manual
annotation restricts the scale and quality of CWS
datasets. Besides, directly fine-tuning methods
do not utilize contextual n-grams or other con-
textual information, which is important for previ-
ous model architectures (e.g., BiLSTM and Trans-
former) (Huang et al., 2015; Ma et al., 2018; Qiu
et al., 2020). The methods that fine-tune PLMs may
generate segmentation errors because of ambigu-