自组织加权增量概率潜在语义分析：面向大数据文档分类的高效方法

需积分: 5 72 浏览量更新于2024-08-13 收藏 2.85MB PDF 举报

本文献主要探讨了一种名为“自组织加权增量概率潜在语义分析”（Self-Organizing Weighted Incremental Probabilistic Latent Semantic Analysis, WIPLSA）的研究方法。随着信息技术的发展，大量的数字内容如新闻、博客、网页、科研文章、书籍等不断涌现，使得信息检索和理解变得愈发复杂。为了应对大数据时代的挑战，研究人员提出了一种适应大规模数据集的新型文本挖掘工具，WIPLSA。 WIPLSA结合了概率潜在语义分析（Probabilistic Latent Semantic Analysis, PLSA）与自组织学习（Self-Organization）和增量学习（Incremental Learning）的理念。PLSA是一种常用的主题模型，它通过分析文档中词语的共现关系来揭示潜在的主题结构。然而，传统PLSA在处理大规模数据时可能会遇到效率问题，特别是当数据集不断增长时。自组织学习强调的是系统自我组织和优化的能力，它能够在无监督或半监督的环境下，通过对数据进行聚类和组织，形成一种无需预先设定的结构。在WIPLSA中，这种特性被用来处理文档中的多主题场景，使得模型能够自动发现和识别文档中的不同主题及其相关性。增量学习则是指模型能够在新数据到来时，实时地更新和改进其性能，而无需重新训练整个模型。这对于处理实时流式数据或不断增长的数据集至关重要。WIPLSA通过增量的方式处理新文档，只对相关的部分进行权重调整，从而提高了计算效率和存储效率。论文指出，WIPLSA的优势在于它在大型数据集上的适用性，以及在文档分类任务中的良好性能。关键词包括概率潜在语义分析、增量学习、相似度和大数据。作者们在2016年2月5日接收了这篇论文，并于2017年4月10日接受发表，版权归属Springer-Verlag Berlin Heidelberg。总结来说，这项研究提供了对大规模文本数据的一种有效处理策略，通过结合自组织、增量学习和概率潜在语义分析，WIPLSA为文本挖掘和信息检索提供了一个更为高效和灵活的解决方案。对于大数据时代的信息管理而言，这是一种具有实际应用价值的技术革新。

Int. J. Mach. Learn. & Cyber.

1 3

probability, it is supposed to be considered as a topic about

“sports”. We use

)

to denote the probability that a par-

ticular document

will be observed,

P(w

)

denotes the

class-conditional probability of a speciﬁc word conditioned

on the latent class variable

, and

P(z

)

signiﬁes a docu-

ment speciﬁc probability distribution over the latent vari-

able space. Each word

in document

can be generated

as follows. First, select a document

with probability

)

Second, pick a latent class

with probability

P(z

)

, and

ﬁnally generate a word

with probability

)

. Figure1

is the graphic model.

The standard procedure for maximum likelihood estima-

tion in PLSA is the Expectation Maximization (EM) algo-

tithm [21]. According to EM algorithm and the PLSA model,

in the E-step, P(z|d,w) is updated by Eq. (1).

It is the probability that a word w in a particular docu-

ment d is explained by the topic corresponding to z. In the

M-step, we update P(w|z) and P(z|d) by Eqs. (2) and (3)

respectively.

(1)

�

, w

P(w

�z

)P(z

�d

)

∑

l=1

P(w

�

)P(z

�

)

(2)

�

∑

i=1

n(d

, w

)P(z

�

, w

)

∑

m=1

∑

i=1

n(d

, w

)P(z

�

, w

)

(3)

�

∑

j=1

n(d

, w

)P(z

�

, w

)

∑

l=1

∑

j=1

n(d

, w

)P(z

�

, w

)

3 Related work

In this section, we will introduce the main incremental

technologies to PLSA and highlights MAP PLSA and

QB PLSA. Due to the variability of increasing data, it is

necessary to discover the dynamic topics and process the

large data set incrementally. Moreover, PLSA models suf-

fer from the problem of inferencing new documents. To

overcome these problems, we incorporate new words and

documents into an existing system for updating a PLSA

model and diﬀerent updating methods are utilized for

model learning [27]. There are several noteworthy related

work. Here, we give a brief introduction.

Hoﬀmann proposed the “fold-in” update scheme in

[29]. The incremental strategy was to update P(z|d) in

the model while keeping the P(w|z) ﬁxed. A “fold-in”

approach similar to this one was also used in [30]. The

authors proposed incrementally Built Aspect Models

(BAMs) to dynamically discover new topics from docu-

ment streams. BAMs were probabilistic models designed

to accommodate new topics with the spectral algorithm.

This approach retained all the conditional probabilities

of the old words, given the old latent variables, and the

spectral step was used to estimate the probabilities of

Table 1 The notation

convention for parameter

estimation

Notations Explanations

D ={d

, d

, … , d

}

Training/adaptation data with N documents

W ={w

, w

, … , w

}

Vocabulary with M words

K The number of topics

𝜃 ={P(w

), P(z

)}

PLSA parameter set with latent variable

Z ={z

, … , z

}

𝜑 ={

𝛼

, 𝛽

}

Hyperparameters of PLSA parameters

P(w

)

and

P(z

)

P(z

, w

)

Posterior probability of latent variable

generating document

and word

n(d

, w

)

Occurrences of word

in document

n(d

)

Total occurrences of

, … , w

} in document



𝜃|𝜃)

Log posterior probability with current estimate

𝜃

and new estimate



𝜃

𝛾

balance factor between new documents and old documents



={D

, … , D

}

Sequence of adaptation documents

, D

, … , D

}

Sequence of input documents, including training ones and adaptation ones.

𝜑

(n)

={𝛼

(n)

j,k

, 𝛽

(n)

k,i

}

At nth epoch, Hyperparameters of PLSA parameters

(

)

and

P(z

)

V The vector representing the document set which has N documents

The ith component of V

𝐰

)

The probability of generating the words in document

d z w

()

(|)

pz d

(|)

pw z

Fig. 1 The graphic model of PLSA

剩余11页未读，继续阅读

weixin_38668160

粉丝: 10
资源: 936

自组织加权增量概率潜在语义分析：面向大数据文档分类的高效方法

概率潜在语义分析的KNN文本分类算法.pdf

神经网络与深度学习python源码潜在语义分析

20210118-华泰证券-华泰人工智能系列之四十一：_基于BERT的分析师研报情感因子.pdf

文本数据聚类分析：NLP中的应用挑战与未来趋势

Alibaba-Dragonwell-Standard-21.0.4.0.4.7-aarch64-linux.tar

【Unity游戏框架】Prodigy Game Framework快速搭建游戏原型

com.harmonyos4.exception.LoadBalancerFailureException.md

Eclipse代码注释模版-codetemplates.xml

PD虚拟机激活，先下载正版软件在启动这个即可完美激活

ssm母婴用品网站.zip

最新资源