Proceedings of NIDC2016
A FAST AND SCALABLE SUPERVISED TOPIC MODEL
USING STOCHASTIC VARIATIONAL INFERENCE AND
MAPREDUCE
Wenzhuo Song
12
, Bo Yang
12
, Xuehua Zhao
23
, Fei Li
4
1
Jilin University, Changchun 130012, China
2
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University,
Changchun 130012, China
3
School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China
4
George Mason University, Fairfax, VA 22030, USA
songwzup@foxmail.com, ybo@jlu.edu.cn, lcrlc@sina.com, fli4@gmu.edu
Abstract: An important and widespread topic in cloud
computing is text analyzing. People often use topic
model which is a popular and effective technology to
deal with related tasks. Among all the topic models,
sLDA is acknowledged as a popular supervised topic
model, which adds a response variable or category
label with each document, so that the model can
uncover the latent structure of a text dataset as well as
retains the predictive power for supervised tasks.
However, sLDA needs to process all the documents at
each iteration in the training period. When the size of
dataset increases to the volume that one node cannot
deal with, sLDA will no longer be competitive. In this
paper we propose a novel model named Mr.sLDA
which extends sLDA with stochastic variational
inference (SVI) and MapReduce. SVI can reduce the
computational burden of sLDA and MapReduce
extends the algorithm with parallelization. Mr.sLDA
makes the training become more efficient and the
training method can be easily implemented in a large
computer cluster or cloud computing. Empirical results
show that our approach has an efficient training
process, and similar accuracy with sLDA.
Keywords: Text classification; MapReduce;
Variational inference; Topic modeling
1 Introduction
Within the large body of research in the fields of
Artificial intelligence or Machine learning, an
important topic is text analyzing. Because of the
increasing prevalence of large dataset, it has become a
big challenge to analyze the large scale text data using
traditional methods.
Topic models are widely used method for analyzing
data set and have been applied in many fields, such as
recommender system, image classification [3] and so
on. Latent Dirichlet Association (LDA) [1] is
acknowledged as the most popular topic model, and
sLDA is a popular extension of LDA, which adds a
real value [2] or category [3] to each document to solve
supervised problems. Compared to other supervised
methods, sLDA can be easily extensible and integrated
with other probability models [10]. On the other side,
sLDA can discover the underlying semantic patterns of
words so that we can use topics to analyze the dataset.
The most popular methods for training topic models
are Gibbs sampling and variational inference. However,
these methods are inefficient and time-consuming.
Gibbs sampling uses amounts of cheap iterations and it
is difficult to diagnose the convergence. Variational
inference (VI) often needs several hundred rounds of
iterations, and at each iteration, it needs to deal with all
the documents. To train classification sLDA by using
these two methods is especially harder and slower,
because the response variable is nonlinear over the
topic assignment, and the softmax distribution’s
parameters, and the normalization factor strongly
couples the topic assignment of each document [9].
Besides, these two training methods are single-process
which cannot be applied to multi-core computers or
extended to a computer cluster.
In this paper, we make use of two methods to address
these problems. And the contributions of this paper are
listed as below:
(1) We develop online sLDA, which uses stochastic
variational inference (SVI) to make the train method of
sLDA more efficient.
(2) We fit online sLDA to MapReduce computing
framework to expand its capacity for cloud computing
and big data.
(3) We apply our method to analyze a large scale
corpus.
This paper is organized as follows. In Section 2, we
make a brief introduction on classification sLDA,
stochastic variational inference and MapReduce. In
Section 3, we propose a novel sLDA with an efficient
training method. In Section 4, we study the empirical
performance of our method. Conclusion will be found
in Section 5.
2 Backgrounds
In this section, we introduced classification supervised