Efficient Methods for Topic Model Inference on Streaming
Document Collections
Limin Yao, David Mimno, and Andrew McCallum
Department of Computer Science
University of Massachusetts, Amherst
{lmyao, mimno, mccallum}@cs.umass.edu
ABSTRACT
Topic models provide a powerful tool for analyzing large
text collections by representing high dimensional data in a
low dimensional subspace. Fitting a topic model given a set
of training documents requires approximate inference tech-
niques that are computationally expensive. With today’s
large-scale, constantly expanding document collections, it is
useful to be able to infer topic distributions for new doc-
uments without retraining the model. In this paper, we
empirically evaluate the performance of several methods for
topic inference in previously unseen documents, including
methods based on Gibbs sampling, variational inference, and
a new method inspired by text classification. The classification-
based inference method produces results similar to iterative
inference methods, but requires only a single matrix multi-
plication. In addition to these inference methods, we present
SparseLDA, an algorithm and data structure for evaluat-
ing Gibbs sampling distributions. Empirical results indicate
that SparseLDA can be approximately 20 times faster than
traditional LDA and provide twice the speedup of previously
published fast sampling methods, while also using substan-
tially less memory.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous
General Terms
Experimentation, Performance, Design
Keywords
Topic modeling, inference
1. INTRODUCTION
Statistical topic model ing has emerged as a popular method
for analyzing large sets of categorical data in applications
from text mining to image analysis to bioinformatics. Topic
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD’09, June 28–July 1, 2009, Paris, France.
Copyright 2009 ACM 978-1-60558-495-9/09/06 ...$5.00.
models such as latent Dirichlet allocation (LDA) [3] have the
ability to identify interpretable low dimensional components
in very high dimensional data. R eprese nting documents as
topic distributions rather than bags of words reduces the ef-
fect of lexical variability while retaining the overall semantic
structure of the corpus.
Although there have recently been advances in fast infer-
ence for topic models, it remains computationally expensive.
Full topic model inference remains infeasible in two common
situations. First, data streams such as blog posts and news
articles are continually updated, and often require real-time
responses in computationally limited settings such as mobile
devices. In this case, although it may periodically be possi-
ble to retrain a model on a snapshot of the entire collection
using an expensive “offline” computation, it is necessary to
be able to project new documents into a latent topic space
rapidly. Second, large scale collections such as information
retrieval corpora and digital libraries may be too big to pro-
cess efficiently. In this case, it would be useful to train a
model on a random sample of documents, and then project
the remaining documents into the latent topic space i nde-
pendently using a MapReduce-style process. In both cases
there is a need for accurate, efficient methods to infer topic
distributions for documents outside the training corpus. We
refer to this task as “inference”, as distinct from “fitting”
topic model parameters from training data.
This paper has two main contributions. First, we present
a new method for topic model infe rence in unseen documents
that is inspired by techniques from discriminative text clas-
sification. We evaluate the performance of this method and
several other methods for topic model inference in terms of
speed and accuracy relative to fully retraining a model. We
carried out experim ents on two datasets, NIPS and Pubmed.
In contrast to Banerjee and Basu [1], who evaluate different
statistical models on streaming tex t data, we focus on a sin-
gle model (LDA) and compare different inference methods
based on this model. Second, since many of the methods we
discuss rely on Gi bbs sampling to infer topic distributions,
we also present a simple method, SparseLDA, for efficient
Gibbs sampling in topic models al ong with a data structure
that results i n very fast sampling performance with a small
memory footprint. SparseLDA is approximately 20 times
faster than highly optimized traditional LDA and twice the
speedup of previously published fast sampling methods [7].
2. BACKGROUND
A statistical topic model represents the words in docu-
ments in a collection W as mixtures of T “topics,” which