NLPIR驱动的中文聊天语料库构建与人工校正策略

172 浏览量更新于2024-08-27 收藏 172KB PDF 举报

在本研究论文中，"中文分词中一种聊天语料库的构建"探讨了一种新颖且实用的方法来创建专门用于中文词分割任务的聊天语料库。该工作主要关注自动分词技术与人工校正相结合的应用，旨在提高中文文本处理的准确性和效率。论文首先介绍了聊天语料库的重要性，特别是在自然语言理解和机器翻译等领域的应用中，高质量的语料库对于训练和优化模型至关重要。研究者们认识到，在日常对话和社交媒体文本中，语言的口语化、多变性以及频繁出现的新词汇对现有分词工具提出了挑战。为了构建这种聊天语料库，研究团队采用了自然语言处理信息检索（NLPIR）技术进行自动分词。NLPIR提供了一种自动化处理大量文本数据的工具，通过预定义的规则和算法对输入文本进行初步的词序列划分。然而，由于NLPIR可能无法完全捕捉到口语表达的多样性，特别是新词和方言的处理，因此存在一定的误分情况。针对这一问题，论文进一步阐述了人工校正环节。研究人员将NLPIR的错误进行分类，识别出那些需要额外注解或修正的部分。他们提出了一个系统化的校正流程，包括但不限于对错误词的重新分析，词汇扩展以覆盖口语表达，以及添加适当的标点符号和断句规则。这样，既保持了大规模数据处理的效率，又提高了分词的准确性。该研究还强调了研究的初步性质，即它是对聊天语料库构建方法的一次探索，为进一步研究和开发更智能的中文分词系统提供了基础。未来的研究可以在此基础上，结合深度学习和大数据分析技术，不断优化自动分词性能，并拓展到更多领域，如情感分析、文本生成等。这篇论文的核心贡献在于提出了一种创新的语料库构建策略，通过结合自动分词与人工校正，为中文聊天文本的处理提供了一个更加精准且适用的平台，为中文自然语言处理领域的研究者们提供了一种有价值的研究方向和实践案例。

The Construction of a kind of Chat Corpus in Chinese Word Segmentation

Xia Yang

Laboratory of Intelligent Information

Processing and Application

Leshan Normal University

Leshan, China

yangxia113@gmail.com

Peng Jin

Laboratory of Intelligent Information

Processing and Application

Leshan Normal University

Leshan, China

jandp@pku.edu.cn

Xingyuan Chen

Laboratory of Intelligent Information

Processing and Application

Leshan Normal University

Leshan, China

cxyforpaper@gmail.com

Abstract—In this thesis, we present a kind of chat corpus in

Chinese word segmentation and we also present its

construction process. This kind of chat corpus works in the

way of combining application of automatic segmentation

technology with the method of manual correction. Thereinto,

the automatic segmentation is performed in the way of using

the Natural Language Processing Information Retrieval

(NLPIR). As to manual correction, errors from NLPIR will be

categorized and some annotation suggestions will be put

forward. Combining using these two methods above, our study,

which is a preliminary study, could be very easy extended to

other Chats texts. What’s more, the corpus, which produced in

our works, could provide a good standard for the research of

Chinese word segmentation, especially in the part of dialogue.

Keywords-Chats Corpus; Chinese Word Segmentation;

manual annotation

I. INTRODUCTION

Chinese word segmentation is one of the most

fundamental and the most important technology in Chinese

information processing. It is the key technology and also a

difficult point in automatic text categorizations, information

retrieval, information filtering, literature indexing, automatic

text generation and etc. Furthermore, it is the key link of

other Chinese applications, such as named entity recognition,

syntactic analysis, semantic analysis and etc.

At present, there are three kinds of algorithms in Chinese

word segmentation. One is the string matching algorithm

[1-3]. One is a kind of algorithm which based on

understanding of the word [4-6]. The other is based on

statistical [7-9].

In order to test the effectiveness of these algorithms,

some specific criterion to evaluate must be made, and in this

case, building an artificial proofreading participle corpus

becomes very important. However, most of the corpus have

been built based on journalism text and communication text

[10-11]. Hence chats corpus with annotation have many

serious disadvantages. Because, it is not similar to the press

release, languages in chat records are mainly informal and

colloquial. Abbreviations and wrong words are frequently

used on the Internet, and what’s worse, in the meantime,

some newly popular network words and emoticons mixed

together when people are using it. So, building a chatting

record corpus becomes particularly important.

With using some instant message software, we have

collected some chat records, and analyzed their

characteristics in some aspects, such as word formation,

word shape, morpheme and etc. And it is characteristics we

analyzed that made some problems which may affect the

results of word segmentation come to light. On the process

of building a corpus, we adopt to use the method of

automatic tagging in combination with artificial correction.

In this way, building a corpus has become more effectively

than before and the workload in subjective tagging of

manual annotation has been greatly reduced. At the end of

our study, we have got a high quality chats corpus with

400K sentences (24, 000 KB).

The rest of this paper is organized in next 3 parts,

including section 2, section 3 and section 4. Section 2 is

about ways of building this corpus while section 3, which

2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

DOI 10.1109/WI-IAT.2015.196

168

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38698311

粉丝: 9

NLPIR驱动的中文聊天语料库构建与人工校正策略

小黄鸡语料库（分词以及未分词）

搜狗中文分词语料

ChatGPT技术对话生成模型的语料库构建与选择.docx

分词-词性标注-词典-中文语料库.zip

Java基于人民日报语料库实现的中文分词处理项目源码,实现FMM和BMM的分词方法

多模态语料库现状与藏文语料库构建方法

中文公开聊天语料库整理：一库在手，语料不愁

中文微博情绪原因语料库的构建与分析

三大知名机构的中文分词语料库资源

汉维医疗平行语料库构建与应用探索

最新资源