P1: JZZ
0521836573c01 CB1028/Feldman 0 521 83657 3 September 25, 2006 20:59
I.1 Defining Text Mining 3
computational linguistics research that transform raw, unstructured, original-format
content (like that which can be downloaded from PubMed) into a carefully struc-
tured, intermediate data format. Knowledge discovery operations, in turn, are oper-
ated against this specially structured intermediate representation of the original doc-
ument collection.
The Document
Another basic element in text mining is the document. For practical purposes, a
document can be very informally defined as a unit of discrete textual data within a
collection that usually, but not necessarily, correlates with some real-world document
such as a business report, legal memorandum, e-mail, research paper, manuscript,
article, press release, or news story. Although it is not typical, a document can be
defined a little less arbitrarily within the context of a particular document collection
by describing a prototypical document based on its representation of a similar class
of entities within that collection.
One should not, however, infer from this that a given document necessarily exists
only within the context of one particular collection. It is important to recognize that a
document can (and generally does) exist in any number or type of collections – from
the very formally organized to the very ad hoc. A document can also be a member of
different document collections, or different subsets of the same document collection,
and can exist in these different collections at the same time. For example, a docu-
ment relating to Microsoft’s antitrust litigation could exist in completely different
document collections oriented toward current affairs, legal affairs, antitrust-related
legal affairs, and software company news.
“Weakly Structured” and “Semistructured” Documents
Despite the somewhat misleading label that it bears as unstructured data, a text
document may be seen, from many perspectives, as a structured object. From a lin-
guistic perspective, even a rather innocuous document demonstrates a rich amount
of semantic and syntactical structure, although this structure is implicit and to some
degree hidden in its textual content. In addition, typographical elements such as
punctuation marks, capitalization, numerics, and special characters – particularly
when coupled with layout artifacts such as white spacing, carriage returns, underlin-
ing, asterisks, tables, columns, and so on – can often serve as a kind of “soft markup”
language, providing clues to help identify important document subcomponents such
as paragraphs, titles, publication dates, author names, table records, headers, and
footnotes. Word sequence may also be a structurally meaningful dimension to a
document. At the other end of the “unstructured” spectrum, some text documents,
like those generated from a WYSIWYG HTML editor, actually possess from their
inception more overt types of embedded metadata in the form of formalized markup
tags.
Documents that have relatively little in the way of strong typographical, layout, or
markup indicators to denote structure – like most scientific research papers, business
reports, legal memoranda, and news stories – are sometimes referred to as free-
format or weakly structured documents. On the other hand, documents with extensive
and consistent format elements in which field-type metadata can be more easily
inferred – such as some e-mail, HTML Web pages, PDF files, and word-processing