信息危机解决者：文本挖掘与链接检测技术

5星 · 超过95%的资源需积分: 43 95 浏览量更新于2024-07-31 8 收藏 5.97MB PDF 举报

"《文本挖掘-英文版》是一本深入探讨文本挖掘领域的经典著作，旨在解决信息过载问题，通过融合数据挖掘、机器学习、自然语言处理、信息检索和知识管理等技术来提供新的解决方案。书中的内容也涉及到了链接检测，这是一种快速发展的文本分析方法，它在文本挖掘的基础上，通过建立对象之间的关系网络来发现模式和趋势。" 正文: 文本挖掘（Text Mining）是计算机科学领域的新颖且引人入胜的研究方向，它针对信息爆炸时代带来的挑战，即信息过载问题。通过整合不同领域的技术，如数据挖掘——用于从大量数据中提取有价值信息的技术，机器学习——让计算机通过学习和经验提升性能的方法，自然语言处理（NLP）——理解、解析和生成人类语言的能力，信息检索——搜索和获取所需信息的技术，以及知识管理——组织、存储和传播知识的策略，文本挖掘为应对海量文本数据提供了有力工具。链接检测（Link Detection）是文本挖掘的一个分支，它在文本分析中扮演着重要角色。这一方法主要关注于在大量数据源中提取稀疏证据，并将这些证据相互连接，形成一个关系网络。通过这种方式，可以揭示隐藏的模式和趋势。链接检测的主要任务包括：从数据中提取实体和关系，发现实体间的联系，以及将这些联系整合到一起，同时评估这些关联证据的重要性。此外，链接检测还包括学习模式，以指导后续的实体提取、发现和链接过程。《文本挖掘手册》这本著作详细阐述了文本挖掘和链接检测的最新进展。书中不仅深入剖析了文本挖掘的核心概念和技术，还对链接检测的理论与实践进行了全面讨论。通过阅读本书，读者可以了解到如何运用这些技术来挖掘文本数据中的潜在价值，从而更好地利用不断增长的文本资源，发现隐藏的知识和洞察。这本书涵盖了从基础的文本预处理，如词干提取和停用词列表的构建，到高级的主题建模和情感分析。同时，书中也会讨论链接检测中的网络分析，如社团检测和中心性测量，这些工具可以帮助用户理解和揭示文本数据中的复杂结构。此外，书中还会介绍一些实用的算法和工具，以及它们在实际应用中的案例研究，以帮助读者将理论知识转化为实际操作能力。《文本挖掘手册》是了解和掌握文本挖掘及链接检测技术的重要资源，对于想要在这个领域深化研究或在实际工作中应用这些技术的读者来说，是一本不可或缺的参考书。通过深入学习，读者可以提高在信息过载环境中获取、理解和利用信息的能力，为企业决策、市场分析、社会研究等多个领域带来创新和价值。

P1: JZZ

0521836573c01 CB1028/Feldman 0 521 83657 3 September 25, 2006 20:59

I.1 Deﬁning Text Mining 3

computational linguistics research that transform raw, unstructured, original-format

content (like that which can be downloaded from PubMed) into a carefully struc-

tured, intermediate data format. Knowledge discovery operations, in turn, are oper-

ated against this specially structured intermediate representation of the original doc-

ument collection.

The Document

Another basic element in text mining is the document. For practical purposes, a

document can be very informally deﬁned as a unit of discrete textual data within a

collection that usually, but not necessarily, correlates with some real-world document

such as a business report, legal memorandum, e-mail, research paper, manuscript,

article, press release, or news story. Although it is not typical, a document can be

deﬁned a little less arbitrarily within the context of a particular document collection

by describing a prototypical document based on its representation of a similar class

of entities within that collection.

One should not, however, infer from this that a given document necessarily exists

only within the context of one particular collection. It is important to recognize that a

document can (and generally does) exist in any number or type of collections – from

the very formally organized to the very ad hoc. A document can also be a member of

different document collections, or different subsets of the same document collection,

and can exist in these different collections at the same time. For example, a docu-

ment relating to Microsoft’s antitrust litigation could exist in completely different

document collections oriented toward current affairs, legal affairs, antitrust-related

legal affairs, and software company news.

“Weakly Structured” and “Semistructured” Documents

Despite the somewhat misleading label that it bears as unstructured data, a text

document may be seen, from many perspectives, as a structured object. From a lin-

guistic perspective, even a rather innocuous document demonstrates a rich amount

of semantic and syntactical structure, although this structure is implicit and to some

degree hidden in its textual content. In addition, typographical elements such as

punctuation marks, capitalization, numerics, and special characters – particularly

when coupled with layout artifacts such as white spacing, carriage returns, underlin-

ing, asterisks, tables, columns, and so on – can often serve as a kind of “soft markup”

language, providing clues to help identify important document subcomponents such

as paragraphs, titles, publication dates, author names, table records, headers, and

footnotes. Word sequence may also be a structurally meaningful dimension to a

document. At the other end of the “unstructured” spectrum, some text documents,

like those generated from a WYSIWYG HTML editor, actually possess from their

inception more overt types of embedded metadata in the form of formalized markup

tags.

Documents that have relatively little in the way of strong typographical, layout, or

markup indicators to denote structure – like most scientiﬁc research papers, business

reports, legal memoranda, and news stories – are sometimes referred to as free-

format or weakly structured documents. On the other hand, documents with extensive

and consistent format elements in which ﬁeld-type metadata can be more easily

inferred – such as some e-mail, HTML Web pages, PDF ﬁles, and word-processing

P1: JZZ

0521836573c01 CB1028/Feldman 0 521 83657 3 September 25, 2006 20:59

4 Introduction to Text Mining

ﬁles with heavy document templating or style-sheet constraints – are occasionally

described as semistructured documents.

I.1.2 Document Features

The preprocessing operations that support text mining attempt to leverage many

different elements contained in a natural language document in order to transform

it from an irregular and implicitly structured representation into an explicitly struc-

tured representation. However, given the potentially large number of words, phrases,

sentences, typographical elements, and layout artifacts that even a short document

may have – not to mention the potentially vast number of different senses that each

of these elements may have in various contexts and combinations – an essential task

for most text mining systems is the identiﬁcation of a simpliﬁed subset of document

features that can be used to represent a particular document as a whole. We refer to

such a set of features as the representational model of a document and say that indi-

vidual documents are represented by the set of features that their representational

models contain.

Even with attempts to develop efﬁcient representational models, each document

in a collection is usually made up of a large number – sometimes an exceedingly large

number – of features. The large number of features required to represent documents

in a collection affects almost every aspect of a text mining system’s approach, design,

and performance.

Problems relating to high feature dimensionality (i.e., the size and scale of possible

combinations of feature values for data) are typically of much greater magnitude in

text mining systems than in classic data mining systems. Structured representations of

natural language documents have much larger numbers of potentially representative

features – and thus higher numbers of possible combinations of feature values – than

one generally ﬁnds with records in relational or hierarchical databases.

For even the most modest document collections, the number of word-level fea-

tures required to represent the documents in these collections can be exceedingly

large. For example, in an extremely small collection of 15,000 documents culled from

Reuters news feeds, more than 25,000 nontrivial word stems could be identiﬁed.

Even when one works with more optimized feature types, tens of thousands of

concept-level features may still be relevant for a single application domain. The

number of attributes in a relational database that are analyzed in a data mining task

is usually signiﬁcantly smaller.

The high dimensionality of potentially representative features in document col-

lections is a driving factor in the development of text mining preprocessing operations

aimed at creating more streamlined representational models. This high dimension-

ality also indirectly contributes to other conditions that separate text mining systems

from data mining systems such as greater levels of pattern overabundance and more

acute requirements for postquery reﬁnement techniques.

Another characteristic of natural language documents is what might be described

as feature sparsity. Only a small percentage of all possible features for a document

collection as a whole appears in any single document, and thus when a document

is represented as a binary vector of features, nearly all values of the vector are zero.

P1: JZZ

0521836573c01 CB1028/Feldman 0 521 83657 3 September 25, 2006 20:59

I.1 Deﬁning Text Mining 5

The tuple dimension is also sparse. That is, some features often appear in only a few

documents, which means that the support of many patterns is quite low.

Commonly Used Document Features: Characters, Words,

Ter ms, and Concepts

Because text mining algorithms operate on the feature-based representations of

documents and not the underlying documents themselves, there is often a trade-

off between two important goals. The ﬁrst goal is to achieve the correct calibration

of the volume and semantic level of features to portray the meaning of a document

accurately, which tends to incline text mining preprocessing operations toward select-

ing or extracting relatively more features to represent documents. The second goal

is to identify features in a way that is most computationally efﬁcient and practical

for pattern discovery, which is a process that emphasizes the streamlining of repre-

sentative feature sets; such streamlining is sometimes supported by the validation,

normalization, or cross-referencing of features against controlled vocabularies or

external knowledge sources such as dictionaries, thesauri, ontologies, or knowledge

bases to assist in generating smaller representative sets of more semantically rich

features.

Although many potential features can be employed to represent documents,

the

following four types are most commonly used:

 Characters. The individual component-level letters, numerals, special characters

and spaces are the building blocks of higher-level semantic features such as words,

terms, and concepts. A character-level representation can include the full set of all

characters for a document or some ﬁltered subset. Character-based representa-

tions without positional information (i.e., bag-of-characters approaches) are often

of very limited utility in text mining applications. Character-based representations

that include some level of positional information (e.g., bigrams or trigrams) are

somewhat more useful and common. In general, however, character-based rep-

resentations can often be unwieldy for some types of text processing techniques

because the feature space for a document is fairly unoptimized. On the other

hand, this feature space can in many ways be viewed as the most complete of any

representation of a real-world text document.

 Words. Speciﬁc words selected directly from a “native” document are at what

might be described as the basic level of semantic richness. For this reason, word-

level features are sometimes referred to as existing in the native feature space of

a document. In general, a single word-level feature should equate with, or have

the value of, no more than one linguistic token. Phrases, multiword expressions,

or even multiword hyphenates would not constitute single word-level features.

It is possible for a word-level representation of a document to include a feature

for each word within that document – that is the “full text,” where a document is

represented by a complete and unabridged set of its word-level features. This can

Beyond the three feature types discussed and deﬁned here – namely, words, terms, and concepts – other

features that have been used for representing documents include linguistic phrases, nonconsecutive

phrases, keyphrases, character bigrams, character trigrams, frames, and parse trees.

P1: JZZ

0521836573c01 CB1028/Feldman 0 521 83657 3 September 25, 2006 20:59

6 Introduction to Text Mining

lead to some word-level representations of document collections having tens or

even hundreds of thousands of unique words in its feature space. However, most

word-level document representations exhibit at least some minimal optimization

and therefore consist of subsets of representative features ﬁltered for items such

as stop words, symbolic characters, and meaningless numerics.

 Terms. Terms are single words and multiword phrases selected directly from the

corpus of a native document by means of term-extraction methodologies. Term-

level features, in the sense of this deﬁnition, can only be made up of speciﬁc words

and expressions found within the native document for which they are meant to

be generally representative. Hence, a term-based representation of a document

is necessarily composed of a subset of the terms in that document. For example,

if a document contained the sentence

President Abraham Lincoln experienced a career that took him from log cabin

to White House,

a list of terms to represent the document could include single word forms such as

“Lincoln,” “took,” “career,” and “cabin” as well as multiword forms like “Presi-

dent Abraham Lincoln,” “log cabin,” and “White House.”

Several of term-extraction methodologies can convert the raw text of a native

document into a series of normalized terms – that is, sequences of one or more

tokenized and lemmatized word forms associated with part-of-speech tags. Some-

times an external lexicon is also used to provide a controlled vocabulary for term

normalization. Term-extraction methodologies employ various approaches for

generating and ﬁltering an abbreviated list of most meaningful candidate terms

from among a set of normalized terms for the representation of a document. This

culling process results in a smaller but relatively more semantically rich document

representation than that found in word-level document representations.

 Concepts.

Concepts are features generated for a document by means of man-

ual, statistical, rule-based, or hybrid categorization methodologies. Concept-level

features can be manually generated for documents but are now more commonly

extracted from documents using complex preprocessing routines that identify sin-

gle words, multiword expressions, whole clauses, or even larger syntactical units

that are then related to speciﬁc concept identiﬁers. For instance, a document col-

lection that includes reviews of sports cars may not actually include the speciﬁc

word “automotive” or the speciﬁc phrase “test drives,” but the concepts “auto-

motive” and “test drives” might nevertheless be found among the set of concepts

used to to identify and represent the collection.

Many categorization methodologies involve a degree of cross-referencing

against an external knowledge source; for some statistical methods, this source

might simply be an annotated collection of training documents. For manual

and rule-based categorization methods, the cross-referencing and validation of

prospective concept-level features typically involve interaction with a “gold

standard” such as a preexisting domain ontology, lexicon, or formal concept

Although some computer scientists make distinctions between keywords and concepts (e.g., Blake and

Pratt 2001), this book recognizes the two as relatively interchangeable labels for the same feature type

and will generally refer to either under the label concept.

P1: JZZ

0521836573c01 CB1028/Feldman 0 521 83657 3 September 25, 2006 20:59

I.1 Deﬁning Text Mining 7

hierarchy – or even just the mind of a human domain expert. Unlike word- and

term-level features, concept-level features can consist of words not speciﬁcally

found in the native document.

Of the four types of features described here, terms and concepts reﬂect the fea-

tures with the most condensed and expressive levels of semantic value, and there

are many advantages to their use in representing documents for text mining pur-

poses. With regard to the overall size of their feature sets, term- and concept-based

representations exhibit roughly the same efﬁciency but are generally much more

efﬁcient than character- or word-based document models. Term-level representa-

tions can sometimes be more easily and automatically generated from the original

source text (through various term-extraction techniques) than concept-level rep-

resentations, which as a practical matter have often entailed some level of human

interaction.

Concept-level representations, however, are much better than any other feature-

set representation at handling synonymy and polysemy and are clearly best at relat-

ing a given feature to its various hyponyms and hypernyms. Concept-based rep-

resentations can be processed to support very sophisticated concept hierarchies,

and arguably provide the best representations for leveraging the domain knowledge

afforded by ontologies and knowledge bases.

Still, concept-level representations do have a few potential drawbacks. Possi-

ble disadvantages of using concept-level features to represent documents include

(a) the relative complexity of applying the heuristics, during preprocessing opera-

tions, required to extract and validate concept-type features and (b) the domain-

dependence of many concepts.

Concept-level document representations generated by categorization are often

stored in vector formats. For instance, both CDM-based methodologies and Los

Alamos II–type concept extraction approaches result in individual documents being

stored as vectors.

Hybrid approaches to the generation of feature-based document representations

can exist. By way of example, a particular text mining system’s preprocessing oper-

ations could ﬁrst extract terms using term extraction techniques and then match or

normalize these terms, or do both, by winnowing them against a list of meaning-

ful entities and topics (i.e., concepts) extracted through categorization. Such hybrid

approaches, however, need careful planning, testing, and optimization to avoid hav-

ing dramatic – and extremely resource-intensive – growth in the feature dimensional-

ity of individual document representations without proportionately increased levels

of system effectiveness.

For the most part, this book concentrates on text mining solutions that rely on

documents represented by concept-level features, referring to other feature types

where necessary to highlight idiosyncratic characteristics or techniques. Neverthe-

less, many of the approaches described in this chapter for identifying and browsing

patterns within document collections based on concept-level representations can also

It should at least be mentioned that there are some more distinct disadvantages to using manually

generated concept-level representations. For instance, manually generated concepts are ﬁxed, labor-

intensive to assign, and so on. See Blake and Pratt (2001).

剩余422页未读，继续阅读

jollyhope

粉丝: 21
资源: 2

信息危机解决者：文本挖掘与链接检测技术

正则表达式权威指南(第3版) - 英文版

Web数据挖掘：《MININGTHEWEB》英文版解析

超文本与半结构化数据知识发现：英文版解析

基于MATLAB的文本挖掘 - 英文版

数据挖掘-概念与技术 第三版 Jiawei Han 著 中文版以及英文版 PDF

excel 2010高级教程-英文版

数据挖掘-概念与技术(第三版).英文.带标签

用MATLAB做文本挖掘（PDF书籍）

MATLAB文本挖掘基础：工具箱实践，解锁文本数据的秘密

超文本和半结构化数据分析技术(英文版)

最新资源

数据挖掘-概念与技术第三版 Jiawei Han 著中文版以及英文版 PDF