![](https://csdnimg.cn/release/download_crawler_static/9493676/bg11.jpg)
An Introduction to Text Mining 5
commonly used summarization methods.
Unsupervised Learning Methods from Text Data: Unsupervised
learning methods do not require any training data, thus can be applied
to any text data without requiring any manual effort. The two main un-
supervised learning methods commonly used in the context of text data
are clustering and topic modeling. The problem of clustering is that
of segmenting a corpus of documents into partitions, each correspond-
ing to a topical cluster. The problems of clustering and topic modeling
are closely related. In topic modeling we use a probabilistic model in
order to determine a soft clustering, in which each document has a
membership probability of the cluster, as opposed to a hard segmenta-
tion of the documents. Topic models can be considered as the process
of clustering with a generative probabilistic model. Each topic can be
considered a probability distribution over words, with the representative
words having the highest probability. Each document can be expressed
as a probabilistic combination of these different topics. Thus, a topic
can be considered to be analogous to a cluster, and the membership
of a document to a cluster is probabilistic in nature. This also leads
to a more elegant cluster membership representation in cases in which
the document is known to contain distinct topics. In the case of hard
clustering, it is sometimes challenging to assign a document to a sin-
gle cluster in such cases. Furthermore, topic modeling relates elegantly
to the dimension reduction problem, where each topic provides a con-
ceptual dimension, and the documents may be represented as a linear
probabilistic combination of these different topics. Thus, topic-modeling
provides an extremely general framework, which relates to both the clus-
tering and dimension reduction problems. In chapter 4, we study the
problem of clustering, while topic modeling is covered in two chapters
(Chapters 5 and 8). In Chapter 5, we discuss topic modeling from the
perspective of dimension reduction since the discovered topics can serve
as a low-dimensional space representation of text data, where semanti-
cally related words can “match” each other, which is hard to achieve
with bag-of-words representation. In chapter 8, topic modeling is dis-
cussed as a general probabilistic model for text mining.
LSI and Dimensionality Reduction for Text Mining: The prob-
lem of dimensionality reduction is widely studied in the database liter-
ature as a method for representing the underlying data in compressed
format for indexing and retrieval [10]. A variation of dimensionality re-
duction which is commonly used for text data is known as latent seman-
tic indexing [6]. One of the interesting characteristics of latent semantic
indexing is that it brings our the key semantic aspects of the text data,
which makes it more suitable for a variety of mining applications. For ex-