•
The categorization must work reliably in
spite of textual errors.
•
The categorization must be efficient, con-
suming as little storage and processing
time as possible, because of the sheer vol-
ume of documents to be handled.
•
The categorization must be able to recog-
nize when a given document does
not
match any category, or when it falls
between
two categories. This is because
category boundaries are almost never clear-
cut.
In this paper we will cover the following top-
ics:
•
Section 2.0 introduces N-grams and N-
gram-based similarity measures.
•
Section 3.0 discusses text categorization
using N-gram frequency statistics.
•
Section 4.0 discusses testing N-gram-
based text categorization on a language
classification task.
•
Section 5.0 discusses testing N-gram-
based text categorization on a computer
newsgroup classification task.
•
Section 6.0 discusses some advantages of
N-gram-based text categorization over
other possible approaches.
•
Section 7.0 gives some conclusions, and
indicates directions for further work.
2.0 N-Grams
An N-gram is an N-character slice of a longer
string. Although in the literature the term can
include the notion of any co-occurring set of
characters in a string (e.g., an N-gram made up
of the first and third character of a word), in this
paper we use the term for contiguous slices only.
Typically, one slices the string into a set of over-
lapping N-grams. In our system, we use N-grams
of several different lengths simultaneously. We
also append blanks to the beginning and ending
of the string in order to help with matching
beginning-of-word and ending-of-word situa-
tions. (We will use the underscore character (“_”)
to represent blanks.) Thus, the word “TEXT”
would be composed of the following N-grams:
bi-grams: _T, TE, EX, XT, T_
tri-grams: _TE, TEX, EXT, XT_, T_ _
quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
In general, a string of length
k
, padded with
blanks, will have
k
+1 bi-grams,
k
+1tri-grams,
k
+1 quad-grams, and so on.
N-gram-based matching has had some suc-
cess in dealing with noisy ASCII input in other
problem domains, such as in interpreting postal
addresses ([1] and [2]), in text retrieval ([3] and
[4]), and in a wide variety of other natural lan-
guage processing applications[5]. The key bene-
fit that N-gram-based matching provides derives
from its very nature: since every string is decom-
posed into small parts, any errors that are present
tend to affect only a limited number of those
parts, leaving the remainder intact. If we count
N-grams that are common to two strings, we get
a measure of their similarity that is resistant to a
wide variety of textual errors.
3.0 Text Categorization Using N-
Gram Frequency Statistics
Human languages invariably have some words
which occur more frequently than others. One of
the most common ways of expressing this idea
has become known as Zipf’s Law [6], which we
can re-state as follows:
The
n
th most common word in a human language
text occurs with a frequency inversely propor-
tional to
n
.
The implication of this law is that there is
always a set of words which dominates most of
the other words of the language in terms of fre-
quency of use. This is true both of words in gen-
eral, and of words that are specific to a particular
subject. Furthermore, there is a smooth contin-
uum of dominance from most frequent to least.
The smooth nature of the frequency curves helps
us in some ways, because it implies that we do
not have to worry too much about specific fre-
quency thresholds. This same law holds, at least