Semantic Word Rank Algorithm Based on the Relation Degree of the Words
Huijian Han
Shandong University of
Finance and Economics
Shandong Prov. Key Lab
of Digital Media
Technology
Jinan, China
e-mail:
hanhuijian@sdufe.edu.cn
Kai Fu
Shandong University of
Finance and Economics
Shandong Prov. Key Lab
of Digital Media
Technology
Jinan, China
e-mail:
1253411257@qq.com
Xiusheng Sun
Shandong University of
Finance and Economics
Shandong Prov. Key Lab
of Digital Media
Technology
Jinan, China
e-mail:
1061874536@qq.com
Zhenxian Li
Shandong University of
Finance and Economics
Shandong Prov. Key Lab
of Digital Media
Technology
Jinan, China
e-mail:
765974663@qq.com
Abstract—With the continuous development of the Internet,
the volume of data is soaring sharply, the question answering
system plays an increasingly important role in our lives. The
current question answering system knowledge base is mainly
constructed manually, costing a lot of manpower and material
resources, hindering the expansion of application of question
answering system from a single field to the whole field.
Therefore, based on previous research results, this paper
focused on the construction of domain lexicon and knowledge
base, proposed semantic word rank (SWR) algorithm based on
the relation degree of the words. By extracting the subject
words and characteristic words of the paragraph, a knowledge
base which is marked by the subject words and characteristic
words is constructed automatically. The experimental results
show that the SWR algorithm can effectively improve the
accuracy of the extraction of subject words and characteristic
words, the construction of knowledge base is more scientific
and reasonable.
Keywords-Knowledge base; SWR algorithm; Subject words;
Characteristic words
I. INTRODUCTION
With the website content keeps growing continuously,
the scale of the website is also growing, question answering
system plays an increasingly important role in our lives. But
because the computer information processing and human
thinking are very different, There is quite a difference
between the intended result and the answer retrieved from
question answering system knowledge base that constructed
by computer automatically. So in more cases, manual
annotation is used to construct the knowledge base of
question answering system. But due to the web data are vast,
it is obviously not realistic to construct the knowledge base
manually. Therefore, how to construct the knowledge base of
question answering system automatically becomes a hot
issue in the field of Natural Language Processing.
Since the 1980s, many domestic and overseas scholars
have done a lot of work in the construction of knowledge
base. The domestic and overseas representative knowledge
base includes WordNet [1], MindNet [2], ILD [3], FrameNet
[4] in English and HowNet, The Machine Tractable
Dictionary of Contemporary Chinese Predicate Verbs [5],
CCD [6] in Chinese. The knowledge base is mostly oriented
to the general domain, and constructed manually. That
requires great participation of linguists or experts in the field,
and needs to consume a lot of time and huge manpower. At
present, in addition to the manual construction technology,
the knowledge base also can be constructed in semi-
automatic and automatic ways, the semi-automatic
construction method has a higher dependence on the domain
experts.
This paper proposed a Semantic Word Rank (SWR)
Algorithm based on semantic computation, combining
custom semantic dictionary with keyword extraction
technology. By extracting the subject words and
characteristic words of the paragraph, a knowledge base
which is marked by the subject words and characteristic
words is constructed automatically. The experiment results
show that the SWR Algorithm can extract the subject words
and characteristic words more accurately, the answer
retrieved from the knowledge base based on SWR Algorithm
can meet the requirements of people for getting accurate
information from information display websites.
II. TEXTRANK ALGORITHM
TextRank [7] algorithm is the application of PageRank [8]
algorithm in the field of Natural Language Processing,
especially in the text processing. The word is the most basic
element in the text document, a variety of different words
arranged in different order constitute a paragraph. It is the
collocations of words express the main subjects of the
paragraph. In the TextRank model, the words in text maps to
web pages, and the connection between the words maps to
hyperlinks on web pages. So then, a text document is
converted into a network, the more important word in the
text, the more likely to be the main subject of the paragraph.
The general TextRank model is represented as an
undirected graph
, including the point set
and
the edge set
. Because
is undirected graph, for any
given vertex
, no longer distinguish
and
, collectively referred to as
.
means the number of nodes that are linked to
.
is a damping coefficient, the value of
is usually 0.85.
The value for node
defined as the following formula:
)(
)(
1
*)1()(
)(
j
VLinkj
j
i
VS
VLink
ddVS
j
(1)