Link-PLSA-LDA: A New Unsupervised Model for
Topics and Influence of Blogs
Ramesh Nallapati and William Cohen
{nmramesh@cs.cmu.edu} {wcohen@cs.cmu.edu}
Machine Learning Department,
Carnegie Mellon University,
5000 Forbes Ave., Pittsburgh, PA 15213, USA
Abstract
In this work, we address the twin problems of unsupervised
topic discovery and estimation of topic specific influence of
blogs. We propose a new model that can be used to provide a
user with highly influential blog postings on the topic of the
user’s interest.
We adopt the framework of an unsupervised model called La-
tent Dirichlet Allocation(Blei, Ng, & Jordan 2003), known
for its effectiveness in topic discovery. An extension of this
model, which we call Link-LDA (Erosheva, Fienberg, & Laf-
ferty 2004), defines a generative model for hyperlinks and
thereby models topic specific influence of documents, the
problem of our interest. However, this model does not ex-
ploit the topical relationship between the documents on ei-
ther side of a hyperlink, i.e., the notion that documents tend
to link to other documents on the same topic. We propose
a new model, called Link-PLSA-LDA, that combines PLSA
(Hoffman 1999) and LDA (Blei, Ng, & Jordan 2003) into a
single framework, and explicitly models the topical relation-
ship between the linking and the linked document.
The output of the new model on blog data reveals very inter-
esting visualizations of topics and influential blogs on each
topic. We also perform quantitative evaluation of the model
using log-likelihood of unseen data and on the task of link
prediction. Both experiments show that that the new model
performs better, suggesting its superiority over Link-LDA in
modeling topics and topic specific influence of blogs.
Introduction
Proliferation of blogs in the recent past has posed several
new, interesting challenges to researchers in the information
retrieval and data mining community. In particular, there
is an increasing need for automatic techniques to help the
users quickly access blogs that are not only informative and
popular, but also relevant to the user’s topics of interest.
Significant progress has been made in the recent past, to-
wards this objective. For example Java et al (Java et al.
2006) studied the performance of various algorithms such
as PageRank, HITS and in-degree, on modeling influence
of blogs. Kale et al (Kale et al. 2006) exploited the polar-
ity (agreement/disagreement) of the hyperlinks and applied
a trust propagation algorithm to model the propagation of
influence between blogs.
Copyright
c
2
008, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
The above mentioned papers address modeling influence
in general, but it is also important to model influence of
blogs with respect to the topic of the user’s interest. This
problem has been addressed by the work of Haveliwala
(Haveliwala 2002) in the context of key-word search. In this
paper, PageRanks of documents are pre-computed for a cer-
tain number of topics. At query time, for each document
matching the query, its PageRanks for various topics are
combined based on the similarity of the query to each topic,
to obtain a topic-sensitive PageRank. The author shows that
the new PageRank results in superior performance than the
traditional PageRank on key-wordsearch. The topics used in
the algorithm are, however, obtained from an external repos-
itory.
Ideally, it would be very useful to mine these topics au-
tomatically as well. The problem of automatic topic min-
ing from blogs has been addressed by Glance et al (Na-
talie S. Glance & Tomokiyo 2006), where the authors used a
combination of NLP techniques, clustering and heuristics to
mine topics and trends from blogs. However, this work does
not address modeling the influence of blog postings with re-
spect to the topics discovered.
In our work, we aim at addressing both these problems si-
multaneously, i.e., topic discovery as well as modeling topic
specific influence of blogs, in a completely unsupervised
fashion. Towards this objective, we employ the probabilistic
framework of latent topic models such as the Latent Dirich-
let Allocation (Blei, Ng, & Jordan 2003), and propose a new
model in this framework.
The rest of the paper is organized as follows. In section
, we discuss some of the past work done on joint models of
topics and influence in the framework of latent topic models.
We describe our new model in section . In section , we report
the results of our experiments on blog data. We conclude the
discussion in section with a few remarks on directions for
future work.
Note that in the rest of the paper, we use the terms ‘ci-
tation’ and ‘hyperlink’ interchangeably. Likewise, note that
the term ‘citing’ is synonymous to ‘linking’ and so is ‘cited’
to ‘linked’. The reader is also recommended to refer to table
1 for some frequent notation used in this paper.