An Improved Latent Dirichlet Allocation Method
for
Service Topic Detection
GUO Lantian
1
, LI Zhe
1
, YANG Tao
1,2
*, ZHANG Huixiang
1
, MU Dejun
1
, LI Yang
1
1. School of Automation, Northwestern Polytechnical University, Xi’an 710072, China
E-mail: yangtao107@nwpu.edu.cn
2. State Key Lab for Manufacturing Systems Engineering, Xi’an Jiaotong University, Xi’an 710049, China
Abstract: Service topic detection is one of the most important techniques in service information extraction, clustering and
recommendation. Comparing with short text corpus in social network, service description corpus possesses higher
dimensionality and more diversity. It is difficult to detect topics from a large number of service descriptions. To address these
challenges, we proposed a new LDA (Latent Dirichlet Allocation) model based topic detection method, referred to as CV-LDA
(Context sensitive word Vector based LDA). It utilizes a word embedding based method that generate context sensitive vector to
cluster the words for decreasing dimensionality. Through topic perplexity analysis in the real-world dataset, it is obvious that
topics detected by our method has a lower perplexity, comparing with word frequency weighing based vectors.
Key Words: Word Embedding, LDA Model, Service Topic, Perplexity
1 Introduction
With the explosive development of Internet of Things
(IoT) and Industry 4.0, humans, Smart Factories and Smart
Devices can connect and communicate with each other via
the Internet of Things and the Internet of Services. Service
Oriented Architecture (SOA) is a middleware model linking
different functional units through defined interfaces between
these services [1]. Both individual and enterprise customers
apply web or cloud services to their commercial information
systems or personal applications.
The rapid growth of web services and cloud services
posed a lot of challenges for web service or cloud service
clustering, selection and recommendation, when user and
developer encounter such a large number of services. In
service clustering and recommendation system, most
research work concentrate on extracting QoS rating and
information of services, however since many service
developers have only published a limited number of services
QoS information, QoS data obtained are usually very limited
[2]. One approach to addressing service discovery and
recommendation problem is exploiting semantic information
of services.
Probabilistic Topic Models is a machine learning
technology whose aims is to explore and find out the hidden
topic structure in large scale documents [3]. LDA (Latent
Dirichlet Allocation) is a probabilistic graphical model for
topic discovery. In more detail, LDA can learn the topic
representation of each document and the words associated to
each topic. LDA mode has many successful applications on
news articles and academic articles abstracts. However,
unlike the news and academic articles, service description
corpus is extreme short, which is so high dimensionality
incredibly that hinders process efficiency. And
simultaneously, there is topics mal-distribution problem
which makes topics unclear.
*
This work is supported by National Natural Science Foundation (NNSF)
of China under Grant 61402373,61303224,61403311.
Word embedding is a set of language modeling and
learning techniques in natural language processing where
words can be mapped to vectors in a low dimensional space
[4]. Neural network based word embedding methods have
been proven to build high quality syntactic and semantic
relationships between words and their vector space [5].
Neural network based word embedding methods can capture
semantic context features utilizing vector representations of
words created by itself, which can be used to cluster the
similar features for representing semantic efficiently and
reducing the dimensionality of LDA model.
In this paper, we conduct an improved LDA method for
service description topic detection. Its main contributions
are as follows:
z We introduce limitations of LDA model in service
description corpus processing assignment, and the
superiority in word embedding vectorization method.
z To address the problem in existing method, we proposed
a service description topic detection method,
combining context sensitive vectorization and LDA
model, and append service snippets from search engine
as auxiliary corpus for service description as well.
z We crawl a large number of latest service description
corpus in real-word, and get experiment results from
this real-world dataset.
2 Related Work
Existing service topic detection approaches rely on
information on WSDL(Web Services Description Language),
which is a kind of structured document provided by the
service publisher. Liu, et al. [6] utilize text mining method to
extract some structured features such as host name, service
name, and service content from WSDL document.
A significant limitation of existing service topic detection
approaches is that they solely focus on utilizing structured
documents. The service topic detection system cannot make
use of heterogeneous and unstructured semantic information.
In order to deal with the semantic information, Chen, et al. [7]
use LDA model to extract feature word from service
Proceedings of the 35th Chinese Control Conference
July 27-29, 2016, Chengdu, China
7045