大规模监督主题模型：随机变分推理与MapReduce结合的应用

36 浏览量更新于2024-08-26 收藏 966KB PDF 举报

"这篇研究论文探讨了如何使用随机变分推理(Stochastic Variational Inference, SVI)和MapReduce框架来实现快速且可扩展的监督主题模型(Supervised Latent Dirichlet Allocation, sLDA)。在云计算环境中，文本分析是一个重要且广泛的应用领域，而主题模型作为处理相关任务的有效工具，被广泛应用。特别是sLDA，它通过为每个文档添加响应变量或类别标签，能够在揭示文本数据集潜在结构的同时，保持对监督任务的预测能力。然而，sLDA在训练过程中需要在每次迭代时处理所有文档，这在面对大规模数据集时变得效率低下。因此，研究者们提出了结合SVI和MapReduce的方法，以解决这一问题。" 本文首先介绍了sLDA的基本概念，强调了其在文本挖掘和预测分析中的价值，但同时也指出其在处理大规模数据时的局限性。然后，论文详细阐述了随机变分推理的概念，这是一种用于近似贝叶斯推断的统计方法，能够有效地处理高维和复杂的概率模型。通过将SVI应用到sLDA中，可以显著降低计算复杂度，使得模型训练更高效。接着，论文讨论了MapReduce编程模型，这是一个用于分布式计算的框架，尤其适合处理大规模数据集。Map阶段将原始数据分解成小块并分配给各个计算节点，Reduce阶段则将这些节点的结果整合。通过巧妙地设计Map和Reduce函数，sLDA的训练过程可以在分布式系统上并行执行，从而实现计算的可扩展性。在论文的实验部分，作者们展示了所提出方法在真实数据集上的性能，对比了传统的sLDA训练方法，证明了新方法在处理大数据集时的显著优势，包括更快的收敛速度和更高的计算效率。此外，论文还可能包含了对模型精度的评估，以及与其他监督主题模型的比较，以证明该方法的有效性和实用性。这篇研究论文提供了使用随机变分推理和MapReduce优化监督主题模型的新思路，对于大数据环境下的文本分析和机器学习任务具有重要的理论和实际意义。这种方法不仅提高了模型训练的效率，也适应了当前大数据时代的需求，对于云计算和文本挖掘领域的研究者和技术开发者具有很高的参考价值。

Proceedings of NIDC2016

A FAST AND SCALABLE SUPERVISED TOPIC MODEL

USING STOCHASTIC VARIATIONAL INFERENCE AND

MAPREDUCE

Wenzhuo Song

, Bo Yang

, Xuehua Zhao

, Fei Li

Jilin University, Changchun 130012, China

Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University,

Changchun 130012, China

School of Digital Media, Shenzhen Institute of Information Technology, Shenzhen 518172, China

George Mason University, Fairfax, VA 22030, USA

songwzup@foxmail.com, ybo@jlu.edu.cn, lcrlc@sina.com, fli4@gmu.edu

Abstract: An important and widespread topic in cloud

computing is text analyzing. People often use topic

model which is a popular and effective technology to

deal with related tasks. Among all the topic models,

sLDA is acknowledged as a popular supervised topic

model, which adds a response variable or category

label with each document, so that the model can

uncover the latent structure of a text dataset as well as

retains the predictive power for supervised tasks.

However, sLDA needs to process all the documents at

each iteration in the training period. When the size of

dataset increases to the volume that one node cannot

deal with, sLDA will no longer be competitive. In this

paper we propose a novel model named Mr.sLDA

which extends sLDA with stochastic variational

inference (SVI) and MapReduce. SVI can reduce the

computational burden of sLDA and MapReduce

extends the algorithm with parallelization. Mr.sLDA

makes the training become more efficient and the

training method can be easily implemented in a large

computer cluster or cloud computing. Empirical results

show that our approach has an efficient training

process, and similar accuracy with sLDA.

Keywords: Text classification; MapReduce;

Variational inference; Topic modeling

1 Introduction

Within the large body of research in the fields of

Artificial intelligence or Machine learning, an

important topic is text analyzing. Because of the

increasing prevalence of large dataset, it has become a

big challenge to analyze the large scale text data using

traditional methods.

Topic models are widely used method for analyzing

data set and have been applied in many fields, such as

recommender system, image classification [3] and so

on. Latent Dirichlet Association (LDA) [1] is

acknowledged as the most popular topic model, and

sLDA is a popular extension of LDA, which adds a

real value [2] or category [3] to each document to solve

supervised problems. Compared to other supervised

methods, sLDA can be easily extensible and integrated

with other probability models [10]. On the other side,

sLDA can discover the underlying semantic patterns of

words so that we can use topics to analyze the dataset.

The most popular methods for training topic models

are Gibbs sampling and variational inference. However,

these methods are inefficient and time-consuming.

Gibbs sampling uses amounts of cheap iterations and it

is difficult to diagnose the convergence. Variational

inference (VI) often needs several hundred rounds of

iterations, and at each iteration, it needs to deal with all

the documents. To train classification sLDA by using

these two methods is especially harder and slower,

because the response variable is nonlinear over the

topic assignment, and the softmax distribution’s

parameters, and the normalization factor strongly

couples the topic assignment of each document [9].

Besides, these two training methods are single-process

which cannot be applied to multi-core computers or

extended to a computer cluster.

In this paper, we make use of two methods to address

these problems. And the contributions of this paper are

listed as below:

(1) We develop online sLDA, which uses stochastic

variational inference (SVI) to make the train method of

sLDA more efficient.

(2) We fit online sLDA to MapReduce computing

framework to expand its capacity for cloud computing

and big data.

(3) We apply our method to analyze a large scale

corpus.

This paper is organized as follows. In Section 2, we

make a brief introduction on classification sLDA,

stochastic variational inference and MapReduce. In

Section 3, we propose a novel sLDA with an efficient

training method. In Section 4, we study the empirical

performance of our method. Conclusion will be found

in Section 5.

2 Backgrounds

In this section, we introduced classification supervised

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38551376

粉丝: 2
资源: 886

大规模监督主题模型：随机变分推理与MapReduce结合的应用

基于随机变分推理的并行在线监督主题模型

Mr.LDA:在MapReduce中使用变分推理的可扩展主题建模

Plogs：使用MapReduce实现数据记录程序以进行可扩展的推理

mapreduce快速入门

一个基于MapReduce和MPI的图计算模型.zip

mr-libsvm:基于 MapReduce 的 LIBSVM 扩展

mapreduce mapreduce mapreduce

大规模OWL 2 EL本体推理：MapReduce实现与挑战

大规模推理的MapReduce实现：Plogs与Datalog程序的并行物质化

Google MapReduce：大规模数据处理模型

最新资源