《文本挖掘指南》：情感分析与NLP深度解析

需积分: 9 48 浏览量更新于2023-06-01 3 收藏 4.72MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

4 MINING TEXT DATA

to improve the learning process from one domain to another. Similarly,

cross-lingual linkages between documents of diﬀerent languages can also

be used in order to transfer knowledge from one language domain to

another. This is closely related to the problem of transfer learning [11].

The rest of this chapter is organized as follows. The next section

will discuss the diﬀerent kinds of algorithms and applications for text

mining. We will also point out the speciﬁc chapters in which they are

discussed in the book. Section 3 will discuss some interesting future

research directions.

2. Algorithms for Text Mining

In this section, we will explore the key problems arising in the con-

text of text mining. We will also present the organization of the diﬀerent

chapters of this book in the context of these diﬀerent problems. We in-

tentionally leave the deﬁnition of the concept ”text mining” vague to

broadly cover a large set of related topics and algorithms for text anal-

ysis, spanning many diﬀerent communities, including natural language

processing, information retrieval, data mining, machine learning, and

many application domains such as the World Wide Web and Biomedi-

cal Science. We have also intentionally allowed (sometimes signiﬁcant)

overlaps between chapters to allow each chapter to be relatively self

contained, thus useful as a standing-alone chapter for learning about a

speciﬁc topic.

Information Extraction from Text Data: Information Extraction

is one of the key problems of text mining, which serves as a starting

point for many text mining algorithms. For example, extraction of enti-

ties and their relations from text can reveal more meaningful semantic

information in text data than a simple bag-of-words representation, and

is generally needed to support inferences about knowledge buried in text

data. Chapter 2 provides an survey of key problems in Information Ex-

traction and the major algorithms for extracting entities and relations

from text data.

Text Summarization: Another common function needed in many text

mining applications is to summarize the text documents in order to ob-

tain a brief overview of a large text document or a set of documents on

a topic. Summarization techniques generally fall into two categories. In

extractive summarization, a summary consists of information units ex-

tracted from the original text; in contrast, in abstractive summarization,

a summary may contain “synthesized” information units that may not

necessarily occur in the text documents. Most existing summarization

methods are extractive, and in Chapter 3, we give a brief survey of these

An Introduction to Text Mining 5

commonly used summarization methods.

Unsupervised Learning Methods from Text Data: Unsupervised

learning methods do not require any training data, thus can be applied

to any text data without requiring any manual eﬀort. The two main un-

supervised learning methods commonly used in the context of text data

are clustering and topic modeling. The problem of clustering is that

of segmenting a corpus of documents into partitions, each correspond-

ing to a topical cluster. The problems of clustering and topic modeling

are closely related. In topic modeling we use a probabilistic model in

order to determine a soft clustering, in which each document has a

membership probability of the cluster, as opposed to a hard segmenta-

tion of the documents. Topic models can be considered as the process

of clustering with a generative probabilistic model. Each topic can be

considered a probability distribution over words, with the representative

words having the highest probability. Each document can be expressed

as a probabilistic combination of these diﬀerent topics. Thus, a topic

can be considered to be analogous to a cluster, and the membership

of a document to a cluster is probabilistic in nature. This also leads

to a more elegant cluster membership representation in cases in which

the document is known to contain distinct topics. In the case of hard

clustering, it is sometimes challenging to assign a document to a sin-

gle cluster in such cases. Furthermore, topic modeling relates elegantly

to the dimension reduction problem, where each topic provides a con-

ceptual dimension, and the documents may be represented as a linear

probabilistic combination of these diﬀerent topics. Thus, topic-modeling

provides an extremely general framework, which relates to both the clus-

tering and dimension reduction problems. In chapter 4, we study the

problem of clustering, while topic modeling is covered in two chapters

(Chapters 5 and 8). In Chapter 5, we discuss topic modeling from the

perspective of dimension reduction since the discovered topics can serve

as a low-dimensional space representation of text data, where semanti-

cally related words can “match” each other, which is hard to achieve

with bag-of-words representation. In chapter 8, topic modeling is dis-

cussed as a general probabilistic model for text mining.

LSI and Dimensionality Reduction for Text Mining: The prob-

lem of dimensionality reduction is widely studied in the database liter-

ature as a method for representing the underlying data in compressed

format for indexing and retrieval [10]. A variation of dimensionality re-

duction which is commonly used for text data is known as latent seman-

tic indexing [6]. One of the interesting characteristics of latent semantic

indexing is that it brings our the key semantic aspects of the text data,

which makes it more suitable for a variety of mining applications. For ex-

6 MINING TEXT DATA

ample, the noise eﬀects of synonymy and polysemy are reduced because

of the use of such dimensionality reduction techniques. Another family

of dimension reduction techniques are probabilistic topic models,notably

PLSA, LDA, and their variants; they perform dimension reduction in a

probabilistic way with potentially more meaningful topic representations

based on word distributions. In chapter 5, we will discuss a variety of

LSI and dimensionality reduction techniques for text data, and their use

in a variety of mining applications.

Supervised Learning Methods for Text Data: Supervised learning

methods are general machine learning methods that can exploit train-

ing data (i.e., pairs of input data points and the corresponding desired

output) to learn a classiﬁer or regression function that can be used to

compute predictions on unseen new data. Since a wide range of applica-

tion problems can be cast as a classiﬁcation problem (that can be solved

using supervised learning), the problem of supervised learning is some-

times also referred to as classiﬁcation. Most of the traditional methods

for text mining in the machine learning literature have been extended

to solve problems of text mining. These include methods such as rule-

based classiﬁer, decision trees, nearest neighbor classiﬁers, maximum-

margin classiﬁers, and probabilistic classiﬁers. In Chapter 6, we will

study machine learning methods for automated text categorization, a

major application area of supervised learning in text mining. A more

general discussion of supervised learning methods is given in Chapter 8.

A special class of techniques in supervised learning to address the issue

of lack of training data, called transfer learning, are covered in Chapter

Transfer Learning with Text Data: The afore-mentioned example

of cross-lingual mining provides a case where the attributes of the text

collection may be heterogeneous. Clearly, the feature representations in

the diﬀerent languages are heterogeneous, and it can often provide use-

ful to transfer knowledge from one domain to another, especially when

their is paucity of data in one domain. For example, labeled English

documents are copious and easy to ﬁnd. On the other hand, it is much

harder to obtain labeled Chinese documents. The problem of transfer

learning attempts to transfer the learned knowledge from one domain to

another. Some other scenarios in which this arises is the case where we

have a mixture of text and multimedia data. This is often the case in

many web-based and social media applications such as Flickr, Youtube

or other multimedia sharing sites. In such cases, it may be desirable to

transfer the learned knowledge from one domain to another with the use

of cross-media transfer. Chapter 7 provides a detailed survey of such

learning techniques.

An Introduction to Text Mining 7

Probabilistic Techniques for Text Mining: A variety of probabilis-

tic methods, particularly unsupervised topic models such as PLSA and

LDA and supervised learning methods such as conditional random ﬁelds

are used frequently in the context of text mining algorithms. Since such

methods are used frequently in a wide variety of contexts, it is useful

to create an organized survey which describes the diﬀerent tools and

techniques that are used in this context. In Chapter 8, we introduce

the basics of the common probabilistic models and methods which are

often used in the context of text mining. The material in this chapter is

also relevant to many of the clustering, dimensionality reduction, topic

modeling and classiﬁcation techniques discussed in Chapters 4, 5, 6 and

Mining Text Streams: Many recent applications on the web create

massive streams of text data. In particular web applications such as

social networks which allow the simultaneous input of text from a wide

variety of users can result in a continuous stream of large volumes of

text data. Similarly, news streams such as Reuters or aggregators such

as Google news create large volumes of streams which can be mined con-

tinuously. Such text data are more challenging to mine, because they

need to be processed in the context of a one-pass constraint [1]. The

one-pass constraint essentially means that it may sometimes be diﬃcult

to store the data oﬄine for processing, and it is necessary to perform

the mining tasks continuously, as the data comes in. This makes algo-

rithmic design a much more challenging task. In chapter 9, we study

the common techniques which are often used in the context of a variety

of text mining tasks.

Cross-Lingual Mining of Text Data: With the proliferation of web-

based and other information retrieval applications to other applications,

it has become particularly useful to apply mining tasks in diﬀerent lan-

guages, or use the knowledge or corpora in one language to another.

For example, in cross-language mining, it may be desirable to cluster a

group of documents in diﬀerent languages, so that documents from dif-

ferent languages but similar semantic topics may be placed in the same

cluster. Such cross-lingual applications are extremely rich, because they

can often be used to leverage knowledge from one data domain into an-

other. In chapter 10, we will study methods for cross-lingual mining of

text data, covering techniques such as machine translation, cross-lingual

information retrieval, and analysis of comparable and parallel corpora.

Text Mining in Multimedia Networks: Text often occurs in the

context of many multimedia sharing sites such as Flickr or Youtube.

A natural question arises as to whether we can enrich the underlying

mining process by simultaneously using the data from other domains

8 MINING TEXT DATA

together with the text collection. This is also related to the problem of

transfer learning, which was discussed earlier. In chapter 11, a detailed

survey will be provided on mining other multimedia data together with

text collections.

Text Mining in Social Media: One of the most common sources of

text on the web is the presence of social media, which allows human

actors to express themselves quickly and freely in the context of a wide

range of subjects [2]. Social media is now exploited widely by commer-

cial sites for inﬂuencing users and targeted marketing. The process of

mining text in social media requires the special ability to mine dynamic

data which often contains poor and non-standard vocabulary. Further-

more, the text may occur in the context of linked social networks. Such

links can be used in order to improve the quality of the underlying min-

ing process. For example, methods that use both link and content [4]

are widely known to provide much more eﬀective results which use only

content or links. Chapter 12 provides a detailed survey of text mining

methods in social media.

Opinion Mining from Text Data: A considerable amount of text on

web sites occurs in the context of product reviews or opinions of diﬀerent

users. Mining such opinionated text data to reveal and summarize the

opinions about a topic has widespread applications, such as in support-

ing consumers for optimizing decisions and business intelligence. spam

opinions which are not useful and simply add noise to the mining pro-

cess. Chapter 13 provides a detailed survey of models and methods for

opinion mining and sentiment analysis.

Text Mining from Biomedical Data: Text mining techniques play

an important role in both enabling biomedical researchers to eﬀectively

and eﬃciently access the knowledge buried in large amounts of literature

and supplementing the mining of other biomedical data such as genome

sequences, gene expression data, and protein structures to facilitate and

speed up biomedical discovery. As a result, a great deal of research work

has been done in adapting and extending standard text mining methods

to the biomedical domain, such as recognition of various biomedical en-

tities and their relations, text summarization, and question answering.

Chapter 14 provides a detailed survey of the models and methods used

for biomedical text mining.

3. Future Directions

The rapid growth of online textual data creates an urgent need for

powerful text mining techniques. As an interdisciplinary ﬁeld, text data

mining spans multiple research communities, especially data mining,

剩余525页未读，继续阅读

loghe

粉丝: 2
资源: 1

会员权益专享

《文本挖掘指南》：情感分析与NLP深度解析

Chengxiang Zhai-Advanced topic in IR

Chengxiang Zhai-introduction to IR

Advanced topic in IR

(+)chengxiang.rar_64U4_chengxiang_matlab 声全息_全息_声全息程序

chengxiang.rar_MATLAB 图像分类_classification_matlab分类_分类_图像分类

chengxiang.zip_RD_sar algorithm_sar 旁瓣抑制_sar图像旁瓣_耦合

服务器虚拟化部署方案.doc

北京市东城区人民法院服务器项目.doc

求集合数据的均方差iction-mast开发笔记

Wom6.3Wom6.3Wom6.3

html网页版python语言pytorch框架的图像分类西瓜是否腐烂识别-含逐行注释和说明文档-不含图片数据集

2020年细分产品出口数据集.xlsx

注重设置让FTP服务器共享更安全.doc

孵化器孵化服务标准(绝对超值).doc

wx116个人健康信息管理-springboot+vue+uniapp-小程序.zip（可运行源码+sql文件+）

简历求职 (11).pptx

软件测试学习日志-自动化测试阶段-day01

html网页版python语言pytorch框架的图像分类草莓品质识别-含逐行注释和说明文档-不含图片数据集

FXMochaPro2023 v10.0.5.38 是一款功能强大的视觉效果和后期制作工具，由 Boris FX 开发

集成通信和网络的多功能服务器--(e-Cube).docx

会员权益专享

最新资源