1628 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 6, DECEMBER 2013
BD − ACI −BCA:
w
u
=
1 + log(f
u,i
)
(1 − s)+sW
i
/
¯
W
i
log
1+f
m
u
/f
d
u
(3)
AB − AFD −BAA (Okapi):
w
u
=
f
u,i
f
u,i
+ τ
i
/¯τ
i
log
1+n/f
d
u
(4)
BI − ACI −BCA:
w
u
=
1 + log(f
u,i
)
(1 − s)+sW
i
/
¯
W
i
1 −
n
u
log
2
(n)
(5)
Lnu.ltu (SMART):
w
u
=
(1 + log(f
u,i
))/(1 + log(
¯
f
u,i
)
(1 − s)+sτ
i
/¯τ
i
log
n/f
d
u
(6)
where f
u,i
is the term frequency of the uth word associated with
the ith document, f
d
u
is the document frequency of term u, f
m
u
is
the largest f
d
u
for all u, W
i
is the document vector l
2
norm, i.e.,
W
i
= x
i
2
,
¯
W
i
is the average W
i
in the entire dataset, τ
i
and ¯τ
i
are the number of unique terms in document i and the average
unique terms, respectively, s is a slope parameter (set to 0.7
[19], [28]), and n
u
is a noise measure of term u [27], [28]. The
NORM weighting was recently used in [20], [21], and [23]; the
other four schemes, which are well-known weighting methods,
were used in [19] and [28].
C. Dimensionality Reduction
A document set can be represented by X =
[x
1
,x
2
,...,x
n
] ∈ R
m ×n
, which is a rectangular matrix
of terms and documents. The desire of latent semantic analysis
is to produce a set Y , which is an accurate representation of X,
but resides in a lower dimensional space. Y is of dimension d,
with d m, and it is produced by the form
Y = V
T
g
X (7)
where V
g
is an m × d linear transformation matrix. Thus, it is
straightforward to replace each document x
i
by its projection
y
i
= V
T
g
x
i
such that we can make between or within com-
parisons facile in the lower dimensional latent semantic space.
There are a number of ways to accomplish this projection. The
transformation matrix V
g
can be obtained by traditional tech-
niques such as the PCA, the LSI, or other dimensionality reduc-
tion approaches [3]. In this study, we use the classical PCA to
determine the matrix V
g
. The PCA is a well-known technique
in the category of dimensionality reduction. In the PCA, the
determination of V
g
is given by maximizing the variance of the
projected vectors, which is in the format of
max
V
g
n
i=1
y
i
−
1
n
n
i=1
y
i
2
2
. (8)
It has been shown that the matrix V
g
is the set of eigenvectors
of the sample covariance matrix associated with the d largest
eigenvalues. Keep this in mind, as we will use this set of global
representations {y
1
,y
2
,...,y
n
}to formulate a hybrid similarity
of two documents (see Section VI).
IV. W
ORD AFFINITY GRAPH
This section introduces a scheme to produce an in-depth doc-
ument representation. First, we segment each document into
paragraphs. Second, we build a word affinity graph, which de-
scribes the local information of each document.
A. Document Segmentation
As we mentioned before, the major drawback of the tradi-
tional modeling methods such as the PCA and the LSI is that
they lack the description of term associations and spatial dis-
tribution information over the reduced space. In this study, we
propose a new document representation that contains this de-
scription. First, each document is segmented into paragraphs.
Since we only considered the HTML documents in this paper,
a Java platform was developed to implement the segmentation.
For the HTML format document, we can use the HTML tags to
identify paragraphs easily. Before document segmentation, we
first filter out the formatted text that appears within the HTML
tags. The text is not accounted for in word counts or docu-
ment features. The overall document partitioning process can
be summarized as follows [20], [23].
1) Partition a document into blocks using the HTML tags:
“<p>,” “<br\>,” “<li>,” “</td>,” etc.
2) Merge the subsequent blocks to form a newparagraph until
the total number of words of the merged blocks exceeds a
paragraph threshold (set at 50).
3) The new block is merged with the previous paragraph
if the total number of words in a paragraph exceeds the
minimum threshold (set at 30).
For the HTML documents, it is noted that there is no rule
for minimum/maximum number of words for paragraphs [20].
Setting a threshold for word counts, however, still enables us
to control the number of paragraphs flexibly in each document
and remove the blocks, which contain only a few words (e.g.,
titles), by being attached to the real paragraph blocks. It is worth
pointing out that we are able to further partition each paragraph
into sentences by marking periods (the tag “\.”) to form a finer
structure such that more semantics can be included.
B. Word Affinity Graph
Building a word affinity graph for each document is to rep-
resent the frequency of term cooccurrence in a paragraph. Con-
sider a graph denoted by a matrix G
i
∈ R
m ×m
, in which each
element g
i,u,v
(u, v =1, 2,...,m) is defined by
g
i,u,v
=
F
u,v
· log
2
(n/DF
u,v
)/G
i
2
,u= v
f
t
u
· log
2
(n/f
d
u
)/G
i
2
,u= v
(9)
where .
2
is the Frobenius norm, F
u,v
is the frequency of the
cooccurrence in a paragraph associated with the terms u and v
in the ith document, DF
u,v
is the document frequency that the
terms u and v coappear in a document, and notations of f
t
u
and
f
d
u
are as described in (1). Note that if we do not consider term