and [51] to improve the effectiveness by effectively using
closed patterns in text mining. In addition, a two-stage
model that used both term-based methods and pattern-
based methods was introduced in [26] to significantly
improve the performance of information filtering.
Natural language processing (NLP) is a modern compu-
tational technology that can help people to understand the
meaning of text documents. For a long time, NLP was
struggling for dealing with uncertainties in human lan-
guages. Recently, a new concept-based model [45], [46] was
presented to bridge the gap between NLP and text mining,
which analyzed terms on the sentence and document levels.
This model included three components. The first compo-
nent analyzed the semantic structure of sentences; the
second component constructed a conceptual ontological
graph (COG) to describe the sematic structures; and the last
component extracted top concepts based on the first two
components to build feature vectors using the standard
vector space model. The advantage of the concept-based
model is that i t can effectively discriminate between
nonimportant terms and meaningful terms which describe
a sentence meaning. Compared with the above methods,
the concept-based model usually relies upon its employed
NLP techniques.
3PATTERN TAXONOMY MODEL
In this paper, we assume that all documents are split into
paragraphs. So a given document d yields a set of paragraphs
PSðdÞ. Let D be a training set of documents, which consists
of a set of positive documents, D
þ
; and a set of negative
documents, D
. Let T ¼ft
1
;t
2
; ...;t
m
g be a set of terms (or
keywords) which can be extracted from the set of positive
documents, D
þ
.
3.1 Frequent and Closed Patterns
Given a termset X in document d,
-
-
X
-
-
is used to denote the
covering set of X for d, which includes all paragraphs dp 2
PSðdÞ such that X dp, i.e.,
-
-
X
-
-
¼fdpjdp 2 PSðdÞ;X dpg.
Its absolute support is the number of occurrences of X in
PSðdÞ, that is sup
a
ðX Þ¼j
-
-
X
-
-
j. Its relative support is the
fraction of the paragraphs that contain the pattern, that is,
sup
r
ðXÞ¼
j
-
-
X
-
-
j
jPSðdÞj
.
A termset X is called frequent pattern if its sup
r
(or sup
a
)
min
sup, a minimum support.
Table 1 lists a set of paragraphs for a given document d,
where PSðdÞ¼fdp
1
;dp
2
; ...;dp
6
g, and duplicate terms were
removed. Let min
sup ¼ 50%, we can obtain ten frequent
patterns in Table 1 using the above definitions. Table 2
illustrates the ten frequent patterns and their covering sets.
Not all frequent patterns in Table 2 are useful. For
example, pattern ft
3
;t
4
g always occurs with term t
6
in
paragraphs, i.e., the shorter pattern, ft
3
;t
4
g, is always a part
of the larger pattern, ft
3
;t
4
;t
6
g, in all of the paragraphs.
Hence, we believe that the shorter one, ft
3
;t
4
g, is a noise
pattern and expect to keep the larger pattern, ft
3
;t
4
;t
6
g, only.
Given a termset X, its covering set
-
-
X
-
-
is a subset of
paragraphs. Similarly, given a set of paragraphs Y PSðdÞ,
we can define its termset, which satisfies
termsetðY Þ¼ftj8dp 2 Y ¼>t2 dpg:
The closure of X is defined as follows:
ClsðXÞ¼termsetð
-
-
X
-
-
Þ:
A pattern X (also a termset) is called closed if and only if
X ¼ ClsðXÞ.
Let X be a closed pattern. We can prove that
sup
a
ðX
1
Þ < sup
a
ðXÞ; ð1Þ
for all patterns X
1
X; otherwise, if sup
a
ðX
1
Þ¼sup
a
ðXÞ,
we have
-
-
X
1
-
-
¼
-
-
X
-
-
;
where sup
a
ðX
1
Þ and sup
a
ðXÞ are the absolute support of
pattern X
1
and X, respectively.
We also have
ClsðXÞ¼termsetð
-
-
X
-
-
Þ¼termsetð
-
-
X
1
-
-
ÞX
1
X;
that is, ClsðXÞ 6¼ X.
3.2 Pattern Taxonomy
Patterns can be structured into a taxonomy by using the
is-a (or subset) relation. For the example of Table 1, where
we have illustrated a set of paragraphs of a document,
and the discovered 10 frequent patterns in Table 2 if
assuming min
sup ¼ 50%. There are, however, only three
32 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 1, JANUARY 2012
TABLE 1
A Set of Paragraphs
TABLE 2
Frequent Patterns and Covering Sets