实体属性知识获取中的新词检测方法与应用

研究论文

120 浏览量更新于2024-08-26 收藏 1.88MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

实体属性知识获取是构建实体属性框架知识系统的关键环节，这一过程需要从大量的真实语料库中提取属性信息。在现实的语料库中，存在大量的新词，这些新词往往无法被传统的词汇分割工具准确识别。新词的检测对于提升知识获取的全面性和准确性至关重要，因为它们反映了语言的动态发展和社会热点。本文主要研究了如何在实体属性知识获取中进行新词检测，提出了一种针对中文语料的新词检测方法。该方法旨在通过分析和处理大规模文本，能够有效地发现其中的未被传统分词程序识别的新词，并对初始的词汇分割结果进行修订，从而提高后续处理的输入精度。这种方法特别适用于不同领域的实际语料库，如科技、文学、社交媒体等，因为它能捕捉到领域特有的新兴词汇，反映特定领域的专业术语或热点话题。新词检测的核心在于对词语的上下文理解和概率模型的运用。它通常包括以下步骤： 1. **预处理**：首先对文本进行标准化处理，如去除停用词、标点符号等，以便专注于识别有意义的词组。 2. **词干提取**：利用词干提取技术（如Porter Stemming或Snowball Stemming）将单词转化为其基本形式，减少歧义。 3. **基于统计的方法**：通过计算词频和词性标注信息，判断某个词是否为新词。如果一个词在常见词汇中罕见，且具有特定的词性特征，可能就是新词。 4. **上下文分析**：结合前后文语境，检查候选词是否形成独特的短语或表达，这有助于排除误判。 5. **深度学习模型**：使用神经网络模型（如LSTM或BERT），通过训练捕捉到新词与常规词汇之间的潜在差异，提高检测准确性。实验结果显示，提出的中文新词检测方法在处理不同领域的实际语料库时表现优异，不仅提高了新词的检测率，还为后续的实体属性抽取、知识表示等步骤提供了更为精准的数据支持。这对于构建动态适应的语言知识体系具有重要意义，也为语言处理和人工智能领域的研究提供了一个有价值的研究方向。

资源详情

资源推荐

Research on Neologism Detection in Entity Attribute Knowledge

Acquisition

Ke Wang

1,2

, Honglin Wu

1,*

College of Computer Science and Engineering, Northeastern University, Shenyang, 110169,

China

Research Center for Artificial Intelligence, Shenyang Linge Technology Co., Ltd., Shenyang,

110004, China

Corresponding Author: wuhl@mail.neu.edu.cn

Keywords: Neologism Detection; Entity Attribute; Knowledge Acquisition

Abstract. According to the requirements for the construction of the knowledge system of the entity

attribute framework, the acquisition of attributes is extracted from large-scale real corpus. Real

corpus must contain neologisms which cannot be identified by word segmentation program. This

paper proposed a method of Chinese neologism detection which can discovery new words in real

corpus, and can be used to revise the initial results of the word segmentation. The experimental

result showed that the proposed method performance well on the real corpus of different fields, and

may provide more accurate input for subsequent processing.

Introduction

According to the requirements for the construction of the knowledge system of the entity

attribute framework, the acquisition of attributes is extracted from large-scale real corpus. Real

corpus must contain neologisms which cannot be identified by word segmentation program. We

need to re-analyze these strings. In the process of re-analyzing, the main task is the detection of the

neologisms. The result of neologism detection can be used to revise the initial results of the word

segmentation, and may provide more accurate input for subsequent processing.

A neologism is a new word or expression in a language, or a new meaning for an existing word

or expression. There are a number of domain-specific vocabularies in the corpus for a particular

domain. These words may not be included in the existing segmentation lexicon or the original

training corpus of the word segmentation model. The segmentation system may segment these

words in a wrong way. This will affect the identification of entities and attributes. For example, the

Chinese word XianKa (video card) will be segmented into two Chinese characters Xian/v Ka/n. We

must detect the neologism XianKa from the wrong segmentation Xian/v Ka/n. So that the recall of

the entity attributes recognition process can be ensured.

Neologism detection is a basic study of natural language processing. At present the main

methods of neologism detection are divided into two types: rule-based and statistical-based. Some

methods were proposed to detect neologisms. Such as through analyzing real corpus from the

Internet, build a large string set, and detect neologism by filtering rules. By using the word

formation rules of the neologisms, established the regular word library and the special word

formation rule base, and combine the relevant filter conditions to identify new words from the

corpus. Consider the probability that a new word will consist of certain Chinese characters in the

view of in-word probability in position. Carry out the detection by combine probabilistic statistical

techniques and the rules. By calculating the covariance frequency among the entries, the candidate

words are preferentially obtained with the frequency threshold, and then the final new words are

determined by rule filtering and artificial decision.

Most of these methods need to combine the rules of the auxiliary, and have poor promotion of

transplantation. In this paper, we combined the characteristics of the domain text, proposed a

method of Chinese neologism detection based on the statistical language model which can

discovery new words in real corpus. The proposed method performance well on the real corpus of

5th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2017)

This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Advances in Engineering, volume 126

690

下载后可阅读完整内容，剩余3页未读，立即下载

冷月鱼

粉丝: 294
资源: 944

实体属性知识获取中的新词检测方法与应用

领域自适应文本挖掘工具（新词发现、情感分析、实体链接等），基于少量种子词和背景知识

jieba添加新词代码

FastText如何手动增加新词?

我有一份自定义词典，现要传入一个新词，查出词典中与这个新词最相似的词语

基于信息熵和互信息的新词提取

举例说明语言系统产生新词的不同方式

新词挖掘 python

中文分词国内外研究现状

python 新词发现

用jieba的paddle模式进行新词发现

jieba的paddle模式进行新词发现，剔除停用词后进行分词，代码如何写

使用jieba库对字符串“欣欣向荣荣借书” 进行分词，并输出结果，若结果不正确， 想办法修正。提示:用方法add word(w [词频，词性])增加新词，词频越大优先级越高。

用jieba的paddle模式对文本A进行分词，以达到新词发现，python代码

mmseg cascade

编写一段python代码，题目为：分别利用Jieba分词提供的四种分词模式对句子“自然语言处理是研究人与计算机之间用 自然语言进行有效通信的各种理论和方法。”进行分词，并对比分词结果。

对data路径中的txt文件分别进行新词发现下的分词、剔除停用词，并在out文件夹下生成对应的新文本文件的代码如何编写

最新资源

使用jieba库对字符串“欣欣向荣荣借书” 进行分词，并输出结果，若结果不正确，想办法修正。提示:用方法add word(w [词频，词性])增加新词，词频越大优先级越高。

编写一段python代码，题目为：分别利用Jieba分词提供的四种分词模式对句子“自然语言处理是研究人与计算机之间用自然语言进行有效通信的各种理论和方法。”进行分词，并对比分词结果。