BIT中文拼写纠正系统在CLP 2014 Bake-off中的表现与策略

11 浏览量更新于2024-08-26 收藏 534KB PDF 举报

在"在CLP 2014 Bake-off上介绍BIT中文拼写更正系统"这篇研究论文中，作者详细阐述了北京理工大学计算机科学与技术学院的研究团队在2014年CLP（Chinese Language Processing）会议上的创新工作。该论文发表在《第三届CIPS-SIGHAN联合会议论文集》上，页码范围为179至185，于2014年10月20日至21日在中国武汉举行。论文的核心内容主要集中在BIT提交的中文拼写纠错系统上。系统分为两个关键部分：首先，通过采用n-gram模型来处理由词段分割错误导致的非单词（non-words）。n-gram模型是一种统计方法，它根据先前和后续的词语出现频率来推测潜在的正确拼写。系统会利用词汇频率、发音相似性、形状相似性和词性标注（POS）信息来修正这些错误。这样做的目的是提高识别和改正错误的可能性，使系统能够更准确地理解中文文本中的潜在拼写错误。其次，针对那些本身就是错误的词（即词性标注异常的词），系统采用异常词性标记来定位问题，并利用依赖关系匹配策略进行纠正。这种方法考虑到了词汇在句子结构中的位置以及与其他词汇之间的关系，从而提供更精确的纠错建议。依赖关系匹配是通过对上下文的语义分析，确定错误词在句子中应有的正确语法角色。实验结果显示，BIT的中文拼写纠正系统展现出了显著的效果。通过对比基准和实际应用中的表现，该系统在处理常见的输入错误、罕见词汇以及复杂语境下的拼写错误方面都取得了令人满意的改进。这不仅证明了系统在技术上的可行性，也为后续的中文自然语言处理研究提供了有价值的技术参考。这篇论文不仅介绍了BIT团队如何通过智能算法和技术手段解决中文拼写错误问题，还展示了在实际应用中的成果，对于理解和提升中文文本处理的准确性具有重要意义。对于从事中文信息处理、自然语言处理或语言模型优化的研究人员来说，这是深入理解中文拼写纠错技术发展的一个重要窗口。

Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 179–185,

Wuhan, China, 20-21 October 2014

Introduction to BIT Chinese Spelling Correction

System at CLP 2014 Bake-off

Min Liu

School of Computer Science

and Technology, Beijing In-

stitute of Technology

luis328@foxmail.com

Ping Jian

School of Computer Science

and Technology, Beijing In-

stitute of Technology

pjian@bit.edu.cn

Heyan Huang

School of Computer Science

and Technology, Beijing In-

stitute of Technology

hhy63@bit.edu.cn

Abstract

This paper describes the Chinese spelling

correction system submitted by BIT at

CLP Bake-off 2014 task 2. The system

mainly includes two parts: 1) N-gram

model is adopted to retrieve the

non-words which are wrongly separated

by word segmentation. The non-words

are then corrected in terms of word fre-

quency, pronunciation similarity, shape

similarity and POS (part of speech) tag.

2) For wrong words, abnormal POS tag

is used to indicate their location and de-

pendency relation matching is employed

to correct them. Experiment results

demonstrate the effectiveness of our

system.

1. Introduction

Spelling check, which is an automatic mecha-

nism to detect and correct human spelling errors,

is a common task in every written language. The

number of people learning Chinese as a Foreign

Language (CFL) is booming in recent decades

and this number is expected to become even

larger for the years to come. However, unlike

English learning environment where many

learning techniques have been developed, tools to

support CFL learners are relatively rare, espe-

cially those that could automatically detect and

correct Chinese spelling and grammatical errors.

For example, Microsoft Word

has not yet sup-

ported these functions for Chinese, although it

supports English for years. In CLP Bake-off 2014,

essays written by CFL learners were collected for

developing automatic spelling checkers. The

aims are that through such evaluation campaigns,

more innovative computer assisted techniques

will be developed, more effective Chinese

learning resources will be built, and the

state-of-art NLP techniques will be advanced for

the educational applications.

By analyzing the training data released by the

CLP 2014 Bake-off task2

and the test data used

in SIGHAN Bake-off 2013

, we find that the

main errors focus on two types: One is wrong

characters which result in “non-words” that are

similar to OOV (out-of-vocabulary). For example,

the writer may misspell “身邊” as “生邊”, and

“根據” as “根處” (The former appears because

of the words’ similar pronunciation and the latter

comes up due to their similar shape). These are

even not words and of course do not exist in the

vocabulary. The other type is words which are

correct in the dictionary but incorrect in the sen-

tence. Some of them may be misspelled, like “情

愛” in phrase “情愛的王宜家”, which is a mis-

spelling of word “親愛”. But we can find “情愛”

in the dictionary and it is not a non-word. Others

are words which are not used correctly. This

usually happens when the writer does not under-

stand their meaning clearly. For example, writ-

ers often confuse “在” and “再”, such as “高雄是

再台灣南部一個現代化城市”. Here, it is “在”

but not “ 再 ” the right one. Different from

non-words, we call these words “wrong words”.

According to the statistics obtained from the

training data of CLP 2014 Back-off, there are

nearly 3,400 wrong words which are about twice

more than non-words, 1,800 ones.

Spelling check and correction is a traditional

task in natural language processing. Pollock and

Zamora (1984) built a misspelling dictionary for

spelling check. Chang (1995) adopted a bi-gram

language model to substitute the confusing

character. Zhang et al. (2000) proposed an ap-

proximate word matching method to detect and

correct spelling errors. Liu et al. (2011)

http://www.cipsc.org.cn/clp2014/webpage/cn/four_

bakeoffs/Bakeoff2014cfp_ChtSpellingCheck_cn.htm

http://tm.itc.ntnu.edu.tw/CNLP/?q=node/27

179

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38742656

粉丝: 16
资源: 905

BIT中文拼写纠正系统在CLP 2014 Bake-off中的表现与策略

clp-java-release_java_DEMO_articlegfu_clp1.comhtml_CLP16.COM_

coin-or-Clp-devel-1.16.10-1.el7.x86_64.rpm

程序员考试刷题-C-Institute-CLP-12-01-Exam-Questions:C-Institute-CLP-12-01-Exam

CLP-vue-node

三星CLP-320-315-320打印机清零

86230925-CLP-S7-Avancado_siemens_

coin-or-Clp-doc-1.17.3-3.el8.noarch.rpm

coin-or-Clp-doc-1.16.10-1.el7.noarch.rpm

coin-or-Clp-devel-1.17.3-3.el8.aarch64.rpm

coin-or-Clp-debugsource-1.17.3-3.el8.aarch64.rpm

最新资源