partition a document into paragraphs and further partition each
paragraph into sentences for HTML format documents (see
Section 3.1), building two different sizes of vocabularies
(see Section 3.2), and construction of multilevel representation
(see Section 3.3).
3.1. Document segmentation
We propose a hierarchical multilevel representation of
documents that contain text content only. To extract the multi-
level structure, a document is segmented into paragraphs that are
further segmented into sentences. We only considered HTML
documents in this paper and developed a Java platform to
implement that kind of segmentation. In HTML format document,
we can use HTML tags to easily identify paragraphs that are
further partitioned into sentences by marking periods. Before
document segmentation, we first filter out the formatted text that
appears within the HTML tags. The text is not accounted for in
word counts or document features.
The overall document partitioning process can be summarized
as follows:
1. Partition the document into blocks using the HTML tags:
‘‘o p4 ’’, ‘‘o br\ 4 ’’, ‘‘o li4 ’’, ‘‘o /td4 ’’, etc.
2. Merge the subsequent blocks to form a new paragraph until
the total number of words of the merged blocks exceeds a
paragraph threshold (set at 50 in this paper). The new block is
merged with the previous paragraph if the total number
of words in a paragraph exceeds the minimum threshold
(set at 30).
3. Partition each generated paragraph into sentences using the
tag ‘‘\.’’.
For HTML documents, it is noted that there is no rule for
minimum/maximum number of words for paragraphs. But the
use of a threshold of word counts still enables us to flexibly
control the number of paragraphs in each document, and makes
the blocks that contain only a few words (e.g. titles) attached to
the real paragraph blocks. In this way, we build a hierarchical
multilevel structure (or tree structure) to describe the semantic
information from global data-view to local data-view.
Thus the document contents are structured in a ‘document-
paragraphs-sentences’ hierarchy. This is a simple way to
generate a hierarchical structure. It can be further improved by
a finer segmentation such as ‘document-sections-pages-
paragraphs-sentences’. But it needs a more complex algorithm
to facilitate this kind of segmentation.
3.2. Vocabulary construction
The main text contents are firstly separated from HTML tags.
We then extract words from all the documents in a dataset and
apply stemming to each word. Stems are used as basic features
instead of original words. Thus ‘program’, ‘programs’, and
‘programming’ are all considered the same word. We remove
the stop words (set of common word like ‘a’, ‘the’, ‘are’, etc.) and
store the stemmed words together with the information of the
term frequency f
t
(the frequency of a word in all documents) and
document frequency f
d
(the number of documents a word
appears). In order to form a histogram vector for each document,
we need to construct a word vocabulary each histogram vector
refers to. Based on the stored term frequency f
t
and document
frequency f
d
information, we use a simple term-weighting
measure, which is similar to the tf–idf, to calculate the weight of
each word
W
t
¼
ffiffiffiffi
f
t
p
idf , ð1Þ
where the inverse-document-frequency idf ¼ log
2
ðN=f
d
Þ, and N is
the total number of documents in the dataset. It is noted that this
term-weighting measure can be replaced by other feature
selection criteria [20]. The words are then sorted in a descending
order according to the weights. Here, we construct two vocabul-
aries denoted as V
1
and V
2
, respectively. V
1
is used to form
histogram vectors of document and paragraphs, whereas V
2
is
used to form signatures of sentences (see Section 3.3). The first N
1
words are selected to construct the vocabulary V
1
and the first N
2
words are selected to construct the vocabulary V
2
. The vocabulary
V
1
is mainly used for DR (see Section 4), and the vocabulary V
2
is
used for further sentence sorting (see Section 5). The vocabulary
size N
2
is supposed to be much larger than N
1
. According to our
empirical study [3,4], using all the words in the dataset to
construct the vocabulary V
1
is not necessarily expected to deliver
the improvement of the DR accuracy because some words may be
noisy features for some topics. Document modeling approaches
[5–10], however, almost used all the words to form the basic
histogram vectors for DR. Efficient feature selection for DR is
still an open problem, which we leave to other researchers.
We also conducted detailed experiments to evaluate the perfor-
mance in terms of different options of the vocabulary sizes (see
Sections 6.4.3 and 6.4.4).
3.3. Multilevel representation
After the vocabulary construction, we use the document
segmentation procedures (see Section 3.1) to partition each
document in the dataset and generate the multilevel representa-
tion in the form of the ‘document-paragraphs-sentences’
structure. The top level contains the histogram of a whole
document and the 2nd level is used for paragraphs. Each element
of the histogram indicates the number of times the corresponding
word in the vocabulary V
1
appears in a document or a paragraph.
The 3rd level used for sentences is different from the upper two
layers. It uses the index number of words that are included in
the vocabulary V
2
, instead of using histograms, to indicate
the presence/absence of words in a sentence. This architecture
has two advantages: saving the storage space (computational
efficiency) and improving the detection accuracy (accuracy
efficiency) because it examines the document more locally.
Since the document level and paragraph level are mainly used
for DR (see Section 4), we apply PCA, a well-known dimension-
ality reduction tool, to word histogram vector for the whole
document and the segmented paragraphs. Here, PCA is employed
to project higher dimensional data into lower dimensional latent
semantic space without losing much statistical information. We
first normalize the histogram vector H
d
i
¼½h
d
t
(t¼1, 2, y, N
1
)of
the ith document
h
d
t
¼
n
t
P
N
1
t ¼ 1
n
t
log
2
N
f
D
t
, ð2Þ
where n
t
is the frequency of the tth word in the vocabulary,
and f
D
t
is the document frequency of the tth word. We then
use the normalized histogram to construct the PCA projection
matrix B. To save the computational burden, we apply PCA only in
the document level. We have used the MATLAB tool [42] to
compute the projection matrix. The compressed histogram vector
F
d
i
¼½f
d
u
(u¼1, 2, y, N
F
) of the ith document is calculated as
H. Zhang, T.W.S. Chow / Pattern Recognition 44 (2011) 471–487474