International Journal of Network Security, Vol.17, No.5, PP.637-642, Sept. 2015 637
A Component Histogram Map Based Text
Similarity Detection Algorithm
Huajun Huang, Shuang Pang, Qiong Deng, and Jiaohua Qin
(Corresponding author: Huajun Huang)
College of Computer and Information Engineering, Central South University of Forestry and Technology
Changsha 410004, China
(Email: hhj0906@163.com)
(Received Apr. 10, 2015; revised and accepted May 16 & May 24, 2015)
Abstract
The conventional text similarity detection usually use
word frequency vectors to represent texts. But it is
high-dimensional and sparse. So in this research, a new
text similarity detection algorithm using component his-
togram map (CHM-TSD) is proposed.This method is
based on the mathematical expression of Chinese charac-
ters, with which Chinese characters can be split into com-
ponents. Then each components occurrence frequency
will be counted for building the component histogram
map (CHM) in a text as text characteristic vector. Four
distance formulas are used to find which the best distance
formula in text similarity detection is. The experiment re-
sults indicate that CHM-TSD achieves a better precision,
recall and F1 than cosine theorem and Jaccard coefficient.
Keywords: Component histogram map, distance calcula-
tion, text similarity detection
1 Introduction
As a branch of natural language processing, text simi-
larity detection is more and more important for infor-
mation security. It has been used in many fields such
as information retrieval (IR), duplicated detection, Data
clustering and classification [3]. In general, there are two
ways for text similarity detection, one is that based on
semantic similarity, and the other one is non-semantic.
Semantic similarity detection usually based on dictionary
computation like HowNet [13] and WordNet [4]. Huang
has ever proposed a method that combined the external
dictionary with TF-IDF to compute text similarity [5].
Some people also use a large-scale corpus for semantic
similarity detection[7], but its uncommon because of its
disadvantages. Non-semantic similarity detection mostly
uses word frequency statistics and string comparison two
methods. The most common used methods of word fre-
quency statistics are VSM [11, 12] the text similarity can
be computed through cosine [14] theorem or Jaccard co-
efficient [10]. In the other hand, Shingling [15] and maxi-
mum string matching algorithm [6] is often used for string
comparison. All of the methods above performance well in
certain situations, but there are also some shortcomings.
For examples, the semantic method based on dictionary
is too depending on person and the knowledge library
to express the sense of a word exactly. Word frequency
statistics is very high-dimensional and sparse [8].
From the above, a new Chinese text similarity de-
tection method was proposed. This method used CHM
(component histogram map) to avoid high-dimensional
and sparse problem. Mathematical expression of Chinese
characters [9], used to split Chinese characters into com-
ponents was the basic theorem for this method. And the
components were taken as research object. Components
are correlated with each other to compose Chinese char-
acters, so these components are correlative. CHM was
built with each components occurrence frequency. Then
the distance between text and duplicate text is calculated
with Bhattacharyya formula. From the results, we can see
that CHM-TSD performance better than cosine theorem
and Jaccard coefficient.
2 Related Theories
In the process of text duplicate detection, text feature
representation and similarity detection are two very im-
portant steps [12]. VSM is the most common method for
text feature representation. Assuring d
i
is the i-th text,
W
ij
is the weight of the j-th word of d
i
, then the i-th text
can be represented as
~
d
i
= (W
i,1
, W
i,2
, ··· , W
i,3
), so all
the texts in the experiment can compose a vector space
D = (
~
d
1
,
~
d
2
, ··· ,
~
d
n
). The similarity of each pair of text
can be computed as two vectors distance through cosine