预测文本挖掘基础：理论与实践

需积分: 9 185 浏览量更新于2024-07-18 收藏 2.9MB PDF 举报

"《预测性文本挖掘基础》是由Sholom M. Weiss、Nitin Indurkhya和Tong Zhang三位作者编著的一本专著，旨在为IT专业人士、管理者、计算机科学高级本科生和初阶研究生提供关于文本挖掘的基础知识。尽管对数据挖掘有一定的背景了解会有所帮助，但这不是必需的。书中部分章节深入探讨了需要数学基础才能充分理解的高级概念，但同时也提供了直观的解释，以便初级读者理解。大部分内容适合具有分析思维的读者阅读。这本书不仅为希望在该领域进行研究的人指明方向，还为想成为文本挖掘实践者的人介绍了推荐的方法和案例研究。软件部分则需要读者熟悉运行命令行程序和编辑配置文件的操作。" 该书作为Springer出版的"Texts in Computer Science"系列之一，由David Gries和Fred B. Schneider担任系列编辑，旨在为读者提供计算机科学领域的深度知识。书中的内容涵盖了预测性文本挖掘的基本理论、方法和实际应用，旨在帮助读者掌握这一领域的核心概念和技术。预测性文本挖掘是数据挖掘的一个分支，专注于从非结构化的文本数据中提取有价值的信息和模式。本书可能会涉及以下知识点： 1. **文本预处理**：包括分词、去除停用词、词干提取和词形还原等步骤，以将文本转化为可分析的形式。 2. **特征选择**：如何从大量词汇中挑选出最具代表性的特征，如TF-IDF（词频-逆文档频率）和词袋模型。 3. **机器学习算法**：介绍用于文本分类和预测的算法，如朴素贝叶斯、支持向量机、决策树和神经网络等。 4. **模式识别**：如何发现文本中的主题或模式，如潜在语义分析（LSA）和潜在 Dirichlet 分配（LDA）。 5. **情感分析**：分析文本中的情感倾向，用于品牌声誉管理、产品评价分析等场景。 6. **案例研究**：书中可能包含实际项目案例，展示如何将这些技术应用于实际问题中，如社交媒体分析、新闻分类和电子邮件过滤等。 7. **评估方法**：如何使用准确率、召回率、F1分数等指标来评估模型的性能。 8. **命令行工具和配置**：介绍如何操作和配置用于文本挖掘的软件工具，以及如何解读和调整其配置文件。对于有志于从事文本挖掘研究或实践的读者，这本书将提供一个全面的入门指南，同时，其提供的直观解释也使得即使是对数学不太熟悉的读者也能从中受益。通过深入理解和应用书中的知识，读者可以提升自己在大数据时代从文本数据中获取洞察力的能力。

4 1 Overview of Text Mining

Many variants of this document and word representation could be explored, but

this is the fundamental concept, where words are attributes and documents are exam-

ples, and together these form a sample of data that can feed our well-known learning

methods. Many machine-learning methods perform accurately with this transforma-

tion, working with far larger amounts of data than humans could hope to process.

These programs have little knowledge of meaning or grammar. They are statistical

methods that lack prior knowledge. They counterbalance that deﬁciency with mas-

sive processing of data, ﬁnding patterns in word combinations that are recurring and

predictive.

The spreadsheet model of data returns us to the familiar territory of classical

data-mining methods. Nevertheless, we would be foolish to rush to apply learning

methods in their original form without taking advantage of the special characteristic

of text. The spreadsheet remains the conceptual model, but it would be impractical,

inefﬁcient, or even ineffective until we understood some of its important differences

from classical numerical data.

Consider a collection of documents. The set of attributes will be the total set of

unique words in the collection. We call this set of words a dictionary. The examples

are the individual documents. We compose a spreadsheet and ﬁll in the cells with

a one for the presence of a word and a zero for its absence. An application might

have many thousands or even millions of documents. The dictionary will converge

to a smaller number of words than the number of documents but can readily number

several hundred thousands. Specialized documents, such as repair manuals with part

numbers that are alphanumeric, may lead to very large dictionaries. It appears that

the spreadsheet model is too unwieldy to be practical.

Viewing the spreadsheet more closely, we see almost all zeros. Unless individ-

ual documents are surprisingly lengthy, almost book length, the matrix is sparse: any

individual document will use only a tiny subset of the potential set of words in a dic-

tionary. Because of that special characteristic, the spreadsheet remains a reasonable

conceptual model of data. Methods that process text will expect sparse spreadsheets

and will leverage that property in their implementations to store only positive cell

values.

Sparseness is not the only representational difference. All the values in a text-

mining spreadsheet are positive. Classical data-mining methods will consider all

values of an attribute, both positive and negative. The decision criteria could readily

say “if word x has value zero, then conclude class y.” In contrast, text-mining meth-

ods mostly concentrate on positive matches, not worrying whether other words are

absent from a document. This view also leads to great simpliﬁcations in processing,

often allowing text-mining programs to operate in what would be considered huge

dimensions for regular data-mining applications.

If we focus on positive occurrences of words, we also have a solution to one of

the bête noires of applying data-mining methods: missing values. The spreadsheet

model for data has a cell for each measurable value in an example. Most methods

expect the cell to have a value. In practical applications, such as when we extract

information from a real-world database, a great deal of information is missing, and

the cell remains empty. An empty cell is not the same as saying that the answer is a

1.2 What Types of Problems Can Be Solved? 5

default value, such as false for a binary-valued attribute or a mean value for a real-

valued attribute. Many schemes have been developed for managing missing values,

almost all with inherent deﬁciencies. These weaknesses are particularly manifest

when the majority of values are missing. For text, missing values are a nonissue:

words are either present or absent from a document. We can completely ﬁll in the

spreadsheet and all the cells.

In our simpliﬁed world of text mining, we have described documents as examples

and words as attributes in a spreadsheet. Although it could be argued that these

are gross simpliﬁcations of the representation needed for text, it is consistent with

our theme of transforming words to numbers, so that known data-mining methods

can be applied. We will present numerous variations on this model of data and its

statistical view of words and text. Thus, although text-mining operates in very high

dimensions, in many situations, processing is effective and efﬁcient because of the

sparseness characteristic of most documents and most practical applications.

Let’s look at the types of problems that we can try to solve with this approach to

data representation and learning methods.

1.2 What Types of Problems Can Be Solved?

A primary focus of our attention is classiﬁcation and prediction. These are among

the most widely studied and applied methods and applications of data mining. Given

a sample of past experience and correct answers for each example, the objective is

to ﬁnd the correct answers for new examples. We will consider those types of prob-

lems, such as text categorization, that are clear applications for predictive methods.

The concept of classiﬁcation can be extended to data that do not have clearly

labeled answers. Our task would be to organize the data in such a way that we can

make up labels or answers and expect these to hold in the future. This process is

referred to as clustering.

Although similarity between documents is an essential ingredient in organizing

unlabeled documents into distinct groups, measuring similarity of documents is an

end in itself. Measuring similarity between documents is fundamental to most forms

of document analysis, especially information retrieval.

The applications that we discuss do not emphasize linguistic analysis. Statistical

and associational relationships are the basis of our presentation. At some point in

the future, a deeper semantic understanding may demonstrate clear performance

advantages. For now, the preeminence of statistical approaches has been shaped

by the increasing capabilities of computer resources. Most important has been the

outpouring of digital data, where libraries of documents are in digital form, ready

for analysis by text-mining methods.

Let’s look at some of the areas where these text-mining methods can work for us.

6 1 Overview of Text Mining

Fig. 1.4 Text categorization

1.3 Document Classiﬁcation

Text categorization is the widely used, but ponderous, name for document classiﬁ-

cation. It is the purest embodiment of the spreadsheet model with labeled answers.

Once the data are transformed to the usual numerical spreadsheet format, standard

data-mining methods are applicable. Figure 1.4 illustrates the document classiﬁca-

tion application. Documents are organized into folders, one folder for each topic.

A new document is presented, and the objective is to place this document in the

appropriate folders. For example, we might have a folder for household or ﬁnancial

documents and we want to add new documents to the correct folder. The applica-

tion is almost always binary classiﬁcation because a document can usually appear

in multiple folders.

Originally, this type of problem was considered a form of indexing, much like the

index of a book. As more and more documents have become available online, the

applicability of this task has broadened. Some of the more obvious tasks are related

to e-mail: for example, automatically forwarding e-mail to the appropriate company

department or detecting spam mail. The spreadsheet model with one column corre-

sponding to the correct answer is the universal classiﬁcation model for data, and the

transformed text data can readily be combined with standard numerical data-mining

data. As an example, you might think that you could predict future stock movements

based on prior experience. You collect news articles that appeared prior to a rise or

fall in stock prices along with company ﬁnancial data. The labels would be binary,

1 for up and 0 for down.

1.4 Information Retrieval

Information retrieval is the topic most commonly associated with online documents.

What is more fundamental to browsing the Internet than a search engine? The gen-

eral task of information retrieval is illustrated in Fig. 1.5. A collection of documents

is obtained, we give clues as to the documents that we want to retrieve from the

1.5 Clustering and Organizing Documents 7

Fig. 1.5 Retrieving matched documents

collection, and then documents matching the clues are presented as answers to our

query.

What are the clues, and how are they used to retrieve relevant documents? The

clues are words that help identify the relevant stored documents. In a typical in-

stance of invoking a search engine, a few words are presented, and these words are

matched to the stored documents. The best matches are presented as the responses.

The process can be generalized to a document matcher, where instead of a few

words, a complete document is presented as a set of clues. The input document is

then matched to all stored documents, retrieving the best-matched documents.

A basic concept for information retrieval is measuring similarity: a comparison

is made between two documents, measuring how similar the documents are. For

comparison, even a small set of words input into a search engine can be consid-

ered as a document that can be matched to others. From one perspective, measuring

similarity is related to predictive methods for learning and classiﬁcation that are

called nearest-neighbor methods. The common theme is measuring similarity, and

variations of these methods are fundamental to information retrieval.

The spreadsheet model of data can readily be used for these tasks. The new doc-

ument is equivalent to a new row. The new row is compared to all the other rows,

and the most similar rows and their associated documents are the answers.

1.5 Clustering and Organizing Documents

For text categorization, we saw that the objective was to place new documents into

the appropriate folders. These folders were created by someone with knowledge of

the document structure, someone who knew the expected topics. What if we have

a collection of documents with no known structure? For example, a company may

have a help desk that receives and records calls by users of their products. The

company might want to learn about the categories and types of complaints. The

general objective is illustrated in Fig. 1.6. Given a collection of documents, ﬁnd a

set of folders such that each holds similar documents.

剩余231页未读，继续阅读

captainbnu

粉丝: 0
资源: 2

预测文本挖掘基础：理论与实践

Fundamentals of Predictive Text Mining(2nd) 无水印pdf

fundamentals of power electronics 高清 pdf

Fundamentals of Physics. Extended

fundamentals of physics 第10版 课件

【数字逻辑fundamentals of digital logic with verilog design | 3rd editio

fundamentals of computer graphics第三版和第四版差别

fundamentals of electric circuits 6th edition

text: fundamentals of digital logic with verilog design, third edition,

fundamentals of microelectronics 答案

fundamentals of photonics 答案

最新资源

fundamentals of physics 第10版课件