Research on Neologism Detection in Entity Attribute Knowledge
Acquisition
Ke Wang
1,2
, Honglin Wu
1,*
1
College of Computer Science and Engineering, Northeastern University, Shenyang, 110169,
China
2
Research Center for Artificial Intelligence, Shenyang Linge Technology Co., Ltd., Shenyang,
110004, China
*
Corresponding Author: wuhl@mail.neu.edu.cn
Keywords: Neologism Detection; Entity Attribute; Knowledge Acquisition
Abstract. According to the requirements for the construction of the knowledge system of the entity
attribute framework, the acquisition of attributes is extracted from large-scale real corpus. Real
corpus must contain neologisms which cannot be identified by word segmentation program. This
paper proposed a method of Chinese neologism detection which can discovery new words in real
corpus, and can be used to revise the initial results of the word segmentation. The experimental
result showed that the proposed method performance well on the real corpus of different fields, and
may provide more accurate input for subsequent processing.
Introduction
According to the requirements for the construction of the knowledge system of the entity
attribute framework, the acquisition of attributes is extracted from large-scale real corpus. Real
corpus must contain neologisms which cannot be identified by word segmentation program. We
need to re-analyze these strings. In the process of re-analyzing, the main task is the detection of the
neologisms. The result of neologism detection can be used to revise the initial results of the word
segmentation, and may provide more accurate input for subsequent processing.
A neologism is a new word or expression in a language, or a new meaning for an existing word
or expression. There are a number of domain-specific vocabularies in the corpus for a particular
domain. These words may not be included in the existing segmentation lexicon or the original
training corpus of the word segmentation model. The segmentation system may segment these
words in a wrong way. This will affect the identification of entities and attributes. For example, the
Chinese word XianKa (video card) will be segmented into two Chinese characters Xian/v Ka/n. We
must detect the neologism XianKa from the wrong segmentation Xian/v Ka/n. So that the recall of
the entity attributes recognition process can be ensured.
Neologism detection is a basic study of natural language processing. At present the main
methods of neologism detection are divided into two types: rule-based and statistical-based. Some
methods were proposed to detect neologisms. Such as through analyzing real corpus from the
Internet, build a large string set, and detect neologism by filtering rules. By using the word
formation rules of the neologisms, established the regular word library and the special word
formation rule base, and combine the relevant filter conditions to identify new words from the
corpus. Consider the probability that a new word will consist of certain Chinese characters in the
view of in-word probability in position. Carry out the detection by combine probabilistic statistical
techniques and the rules. By calculating the covariance frequency among the entries, the candidate
words are preferentially obtained with the frequency threshold, and then the final new words are
determined by rule filtering and artificial decision.
Most of these methods need to combine the rules of the auxiliary, and have poor promotion of
transplantation. In this paper, we combined the characteristics of the domain text, proposed a
method of Chinese neologism detection based on the statistical language model which can
discovery new words in real corpus. The proposed method performance well on the real corpus of
5th International Conference on Machinery, Materials and Computing Technology (ICMMCT 2017)
Copyright © 2017, the Authors. Published by Atlantis Press.
This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Advances in Engineering, volume 126