Link-PLSA-LDA：一种新的无监督主题与博客影响力模型

需积分: 10 57 浏览量更新于2024-09-10 收藏 257KB PDF 举报

"这篇文档主要讨论了如何改进LDA（Latent Dirichlet Allocation）模型，结合了LDA和PLSA（Probabilistic Latent Semantic Analysis）模型，以提升对博客信息的数据挖掘和主题检测能力。它提出了一种新的无监督模型——Link-PLSA-LDA，用于发现话题并估计特定话题的影响力。该模型旨在为用户提供与其兴趣相关领域的高影响力博客文章。" 在信息检索和自然语言处理领域，LDA是一种广泛使用的主题模型，由David Blei、Andrew Ng和Michael Jordan于2003年提出。LDA假设文档是由多个主题混合而成，每个主题又由一系列单词组成，通过贝叶斯推断来发现隐藏的主题结构。LDA模型通过对文档中单词的出现概率进行建模，能够有效地识别出文档的主题内容。然而，LDA模型本身并未考虑文档之间的关系，特别是超链接所体现的语义关联。为了弥补这一不足，Erosheva、Fienberg和Lafferty提出了Link-LDA模型，这是一个同时考虑文档内容和超链接的生成模型，可以用来估计文档的特定话题影响力。但Link-LDA模型没有充分利用超链接两侧文档之间的话题相关性。针对这一问题，研究者提出了Link-PLSA-LDA模型，它融合了PLSA和LDA的特点。PLSA模型同样是一种无监督学习方法，用于揭示文档背后的潜在语义结构，但与LDA不同的是，PLSA模型更侧重于文档内部单词的共现关系，而不是文档间的链接。在Link-PLSA-LDA模型中，超链接不仅仅是文档间关系的表示，还反映了它们共享话题的程度，从而更好地捕捉了网络中信息流的特性。 Link-PLSA-LDA模型的创新之处在于，它既利用了LDA模型在主题发现上的优势，又借鉴了PLSA模型对文档内话题分布的建模，通过这种方式，模型可以更加精确地识别出具有影响力的博客文章，并且在用户兴趣的话题上提供高度相关的推荐内容。 Link-PLSA-LDA模型是对传统LDA模型的一种扩展和优化，通过结合PLSA的特性，提升了在超链接环境中对文档话题的发现和影响力分析，对于信息检索、社交媒体分析和个性化推荐等领域具有重要的应用价值。

Link-PLSA-LDA: A New Unsupervised Model for

Topics and Inﬂuence of Blogs

Ramesh Nallapati and William Cohen

{nmramesh@cs.cmu.edu} {wcohen@cs.cmu.edu}

Machine Learning Department,

Carnegie Mellon University,

5000 Forbes Ave., Pittsburgh, PA 15213, USA

Abstract

In this work, we address the twin problems of unsupervised

topic discovery and estimation of topic speciﬁc inﬂuence of

blogs. We propose a new model that can be used to provide a

user with highly inﬂuential blog postings on the topic of the

user’s interest.

We adopt the framework of an unsupervised model called La-

tent Dirichlet Allocation(Blei, Ng, & Jordan 2003), known

for its effectiveness in topic discovery. An extension of this

model, which we call Link-LDA (Erosheva, Fienberg, & Laf-

ferty 2004), deﬁnes a generative model for hyperlinks and

thereby models topic speciﬁc inﬂuence of documents, the

problem of our interest. However, this model does not ex-

ploit the topical relationship between the documents on ei-

ther side of a hyperlink, i.e., the notion that documents tend

to link to other documents on the same topic. We propose

a new model, called Link-PLSA-LDA, that combines PLSA

(Hoffman 1999) and LDA (Blei, Ng, & Jordan 2003) into a

single framework, and explicitly models the topical relation-

ship between the linking and the linked document.

The output of the new model on blog data reveals very inter-

esting visualizations of topics and inﬂuential blogs on each

topic. We also perform quantitative evaluation of the model

using log-likelihood of unseen data and on the task of link

prediction. Both experiments show that that the new model

performs better, suggesting its superiority over Link-LDA in

modeling topics and topic speciﬁc inﬂuence of blogs.

Introduction

Proliferation of blogs in the recent past has posed several

new, interesting challenges to researchers in the information

retrieval and data mining community. In particular, there

is an increasing need for automatic techniques to help the

users quickly access blogs that are not only informative and

popular, but also relevant to the user’s topics of interest.

Signiﬁcant progress has been made in the recent past, to-

wards this objective. For example Java et al (Java et al.

2006) studied the performance of various algorithms such

as PageRank, HITS and in-degree, on modeling inﬂuence

of blogs. Kale et al (Kale et al. 2006) exploited the polar-

ity (agreement/disagreement) of the hyperlinks and applied

a trust propagation algorithm to model the propagation of

inﬂuence between blogs.

 2

008, Association for the Advancement of Artiﬁcial

The above mentioned papers address modeling inﬂuence

in general, but it is also important to model inﬂuence of

blogs with respect to the topic of the user’s interest. This

problem has been addressed by the work of Haveliwala

(Haveliwala 2002) in the context of key-word search. In this

paper, PageRanks of documents are pre-computed for a cer-

tain number of topics. At query time, for each document

matching the query, its PageRanks for various topics are

combined based on the similarity of the query to each topic,

to obtain a topic-sensitive PageRank. The author shows that

the new PageRank results in superior performance than the

traditional PageRank on key-wordsearch. The topics used in

the algorithm are, however, obtained from an external repos-

itory.

Ideally, it would be very useful to mine these topics au-

tomatically as well. The problem of automatic topic min-

ing from blogs has been addressed by Glance et al (Na-

talie S. Glance & Tomokiyo 2006), where the authors used a

combination of NLP techniques, clustering and heuristics to

mine topics and trends from blogs. However, this work does

not address modeling the inﬂuence of blog postings with re-

spect to the topics discovered.

In our work, we aim at addressing both these problems si-

multaneously, i.e., topic discovery as well as modeling topic

speciﬁc inﬂuence of blogs, in a completely unsupervised

fashion. Towards this objective, we employ the probabilistic

framework of latent topic models such as the Latent Dirich-

let Allocation (Blei, Ng, & Jordan 2003), and propose a new

model in this framework.

The rest of the paper is organized as follows. In section

, we discuss some of the past work done on joint models of

topics and inﬂuence in the framework of latent topic models.

We describe our new model in section . In section , we report

the results of our experiments on blog data. We conclude the

discussion in section with a few remarks on directions for

future work.

Note that in the rest of the paper, we use the terms ‘ci-

tation’ and ‘hyperlink’ interchangeably. Likewise, note that

the term ‘citing’ is synonymous to ‘linking’ and so is ‘cited’

to ‘linked’. The reader is also recommended to refer to table

1 for some frequent notation used in this paper.

下载后可阅读完整内容，剩余8页未读，立即下载

qq_29586487

粉丝: 0

Link-PLSA-LDA：一种新的无监督主题与博客影响力模型

LDA analysis

线性判别式分析(Linear Discriminant Analysis, LDA)MATLAB实现

fisher linear discrimination analysis（LDA）matlab代码

Linear-discriminant-analysis-master.zip_badlyg_lda_lda facerecog

2D-LDA A statistical linear discriminant analysis for image matrix

线性判别分析（Linear Discriminant Analysis， LDA）1

LDA（Linear Discriminant Analysis）：这段代码用来学习和讲解LDA的代码，把这段代码应用到很多应用中。-matlab开发

LDA.rar_LDA 文档主题_java LDA_lda_lda java_lda模型

LDA.zip_LDA IMAGE MATLAB_LDA 图像_lda

LDA.rar_LDA MATLAB_LDA matlab实现_lda

最新资源