4 1 Overview of Text Mining
Many variants of this document and word representation could be explored, but
this is the fundamental concept, where words are attributes and documents are exam-
ples, and together these form a sample of data that can feed our well-known learning
methods. Many machine-learning methods perform accurately with this transforma-
tion, working with far larger amounts of data than humans could hope to process.
These programs have little knowledge of meaning or grammar. They are statistical
methods that lack prior knowledge. They counterbalance that deficiency with mas-
sive processing of data, finding patterns in word combinations that are recurring and
predictive.
The spreadsheet model of data returns us to the familiar territory of classical
data-mining methods. Nevertheless, we would be foolish to rush to apply learning
methods in their original form without taking advantage of the special characteristic
of text. The spreadsheet remains the conceptual model, but it would be impractical,
inefficient, or even ineffective until we understood some of its important differences
from classical numerical data.
Consider a collection of documents. The set of attributes will be the total set of
unique words in the collection. We call this set of words a dictionary. The examples
are the individual documents. We compose a spreadsheet and fill in the cells with
a one for the presence of a word and a zero for its absence. An application might
have many thousands or even millions of documents. The dictionary will converge
to a smaller number of words than the number of documents but can readily number
several hundred thousands. Specialized documents, such as repair manuals with part
numbers that are alphanumeric, may lead to very large dictionaries. It appears that
the spreadsheet model is too unwieldy to be practical.
Viewing the spreadsheet more closely, we see almost all zeros. Unless individ-
ual documents are surprisingly lengthy, almost book length, the matrix is sparse: any
individual document will use only a tiny subset of the potential set of words in a dic-
tionary. Because of that special characteristic, the spreadsheet remains a reasonable
conceptual model of data. Methods that process text will expect sparse spreadsheets
and will leverage that property in their implementations to store only positive cell
values.
Sparseness is not the only representational difference. All the values in a text-
mining spreadsheet are positive. Classical data-mining methods will consider all
values of an attribute, both positive and negative. The decision criteria could readily
say “if word x has value zero, then conclude class y.” In contrast, text-mining meth-
ods mostly concentrate on positive matches, not worrying whether other words are
absent from a document. This view also leads to great simplifications in processing,
often allowing text-mining programs to operate in what would be considered huge
dimensions for regular data-mining applications.
If we focus on positive occurrences of words, we also have a solution to one of
the bête noires of applying data-mining methods: missing values. The spreadsheet
model for data has a cell for each measurable value in an example. Most methods
expect the cell to have a value. In practical applications, such as when we extract
information from a real-world database, a great deal of information is missing, and
the cell remains empty. An empty cell is not the same as saying that the answer is a