基于图卷积网络的中文词语分割技术研究

需积分: 10 105 浏览量更新于2024-08-05 收藏 596KB PDF 举报

中文分词技术综述，近年来的分词算法中文分词技术是自然语言处理（NLP）领域中的一项重要任务，旨在将中文文本分割成独立的词语，以便更好地进行语言理解和处理。近年来，中文分词算法取得了长足的进步，本文将对近年来的中文分词算法进行综述。 1. 中文分词技术的重要性中文分词技术是NLP领域中的一项基础任务，旨在解决中文文本中的词语边界问题。精准的词语边界信息可以减少语言歧义，提高自然语言处理任务的性能。因此，中文分词技术对NLP任务的性能具有重要影响。 2. 传统中文分词算法早期的中文分词算法主要基于规则和统计方法，例如基于词典的方法、基于n-gram模型的方法等。这些方法虽然简单易行，但其性能有限，难以处理复杂的中文文本。 3. 深度学习在中文分词技术中的应用近年来，深度学习技术的出现极大地推动了中文分词技术的发展。基于深度学习的中文分词算法可以自动学习中文文本的表示形式，捕捉到词语边界的模式，从而实现高效的中文分词。例如，卷积神经网络（CNN）和递归神经网络（RNN）等深度学习模型已经被成功应用于中文分词任务中。 4. Lexicon-Based Graph Convolutional Network for Chinese Word Segmentation 最近的一篇论文提出了基于词典的图卷积神经网络（Lexicon-Based Graph Convolutional Network）用于中文分词任务。该方法通过将中文词典信息整合到图卷积神经网络中，实现了高效的中文分词。实验结果表明，该方法可以取得 state-of-the-art 的性能。 5. 近年来的中文分词算法进展近年来，中文分词算法取得了长足的进步。例如，基于预训练语言模型（Pre-trained Language Model）的中文分词算法已经取得了 state-of-the-art 的性能。这些方法通过预训练语言模型来学习中文文本的表示形式，然后 fine-tune 到中文分词任务中，以取得高效的性能。 6. 未来发展方向尽管中文分词技术取得了长足的进步，但仍然存在一些挑战和机遇。例如，如何更好地处理中文文本中的歧义问题，如何将中文分词技术应用于更多的NLP任务等。这些问题的解决将是未来中文分词技术发展的方向。中文分词技术是NLP领域中的一项重要任务，近年来取得了长足的进步。未来，中文分词技术将继续发展，推动NLP领域的发展。

Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2908–2917

2908

Lexicon-Based Graph Convolutional Network for Chinese Word

Segmentation

Kaiyu Huang Hao Yu Junpeng Liu

Wei Liu Jingxiang Cao Degen Huang

∗

School of Computer Science, Dalian University of Technology

{kaiyuhuang, yuhaodlut,liujunpeng_nlp, liuweidlut}

@mail.dlut.edu.cn

{caojx, huangdg}@dlut.edu.cn

Abstract

Precise information of word boundary can al-

leviate the problem of lexical ambiguity to

improve the performance of natural language

processing (NLP) tasks. Thus, Chinese word

segmentation (CWS) is a fundamental task in

NLP. Due to the development of pre-trained

language models (PLM), pre-trained knowl-

edge can help neural methods solve the main

problems of the CWS in signiﬁcant mea-

sure. Existing methods have already achieved

high performance on several benchmarks (e.g.,

Bakeoff-2005). However, recent outstanding

studies are limited by the small-scale anno-

tated corpus. To further improve the perfor-

mance of CWS methods based on ﬁne-tuning

the PLMs, we propose a novel neural frame-

work, LBGCN, which incorporates a lexicon-

based graph convolutional network into the

Transformer encoder. Experimental results

on ﬁve benchmarks and four cross-domain

datasets show the LBGCN successfully cap-

tures the information of candidate words and

helps to improve performance on the bench-

marks (Bakeoff-2005 and CTB6) and the

cross-domain datasets (SIGHAN-2010). Fur-

ther experiments and analyses demonstrate

that our proposed framework effectively mod-

els the lexicon to enhance the ability of basic

neural frameworks and strengthens the robust-

ness in the cross-domain scenario.

1 Introduction

Neural methods often leverage word-level informa-

tion to improve the performance of many down-

stream natural language processing (NLP) tasks

such as text classiﬁcation and machine translation

(Yang et al., 2018), etc. Therefore, in determining

the word boundary, word segmentation is regarded

as a prerequisite for most downstream NLP tasks.

∗

Corresponding author

Source codes of this paper are available on

https://

github.com/koukaiu/lbgcn

Unlike most written languages, the Chinese writ-

ten language has no explicit delimiters to separate

words in the written text. Thus, Chinese word seg-

mentation (CWS) is an essential and pre-processing

step for many Chinese NLP tasks.

With the development of deep learning tech-

niques, recent neural CWS approaches that do not

heavily rely on the hand-craft feature engineering

have already achieved high performance on several

benchmark datasets (Cai and Zhao, 2016; Cai et al.,

2017; Ma et al., 2018). In particular, recent out-

standing studies have also exploited the learning

paradigm in applying pre-trained language mod-

els (PLM) for many NLP tasks. Various methods

that ﬁne-tune PLMs have achieved progress on in-

domain and cross-domain CWS without much man-

ual effort (Meng et al., 2019; Huang et al., 2020;

Tian et al., 2020; Ke et al., 2021).

Prior research has shown that the problems

of CWS are segmentation ambiguity and out-of-

vocabulary (OOV) words (Zhao et al., 2019). With

the help of the pre-trained knowledge (Devlin et al.,

2018; Liu et al., 2019), the ﬁne-tuning CWS meth-

ods can effectively alleviate these two issues and

outperform other neural network architectures. The

methods ﬁne-tuning PLMs become the mainstream

approach for CWS. However, the performance of

ﬁne-tuning CWS methods is limited by the scale

and quality of annotated CWS corpus. The depen-

dencies between neighboring Chinese characters

are diverse and it is hard to build a large-scale an-

notated corpus because of the characteristics of

linguistics in Chinese. The difﬁculty of manual

annotation restricts the scale and quality of CWS

datasets. Besides, directly ﬁne-tuning methods

do not utilize contextual n-grams or other con-

textual information, which is important for previ-

ous model architectures (e.g., BiLSTM and Trans-

former) (Huang et al., 2015; Ma et al., 2018; Qiu

et al., 2020). The methods that ﬁne-tune PLMs may

generate segmentation errors because of ambigu-

下载后可阅读完整内容，剩余9页未读，立即下载

骑单车的王小二

粉丝: 1577
资源: 8

基于图卷积网络的中文词语分割技术研究

中文分词算法近年研究进展

中文分词算法python

中文分词技术算法的设计与实现

jieba实现分词的算法

详述常见的中文分词算法都有什么，写出他们的定义、来源、以及功能、适用背景都有什么

基于深度学习的中文分词算法

详述常见的中文分词算法都有什么，写出他们的定义和来源，以及功能适用背景都有什么

正向最大匹配分词算法

中文分词中动态规划算法的应用

中文分词算法的实现 实验指导

最新资源

中文分词算法的实现实验指导