The remainder of this paper is organized as follows. Related
work is discussed in Section 2. Section 3 introduces the proposed
semi-supervised methods. In Section 4, we describe the experi-
ments comparing the performance of the proposed models with
that of baseline methods. Finally, we conclude this paper and high-
light the directions of future work in Section 5.
2. Related work
2.1. Aspect-based opinion mining
As the amount of product reviews grows rapidly, aspect-based
opinion mining has become a hot research topic [1,2,19,10,
7,11,12,20]. Aspect-based opinion mining aims to produce a sum-
mary of customer opinion for each product aspect from reviews.
This task is technically challenging since it is a context-aware
and domain-dependent problem [21,22]. In reviews, users may
describe the same product aspect using different expressions. For
instance, in reviews of televisions, the expressions ‘‘screen’’ and
‘‘LED’’ refer to the same aspect of television display. Additionally,
the same opinion expression may deliver opposite sentiment
polarities in different domains. For example, the word ‘‘small’’ in
the expression ‘‘the small MP3 is portable’’ represents a positive sen-
timent, while it represents a negative sentiment in the expression
‘‘the bed in hotel is small’’.
To perform the task of aspect-based opinion mining, the pio-
neering works of [1,2] proposed a framework that is now widely
used. In this framework, the opinion-mining task is broken down
into two major subtasks, namely, aspect extraction and sentiment
classification. First, the subtask of aspect extraction identifies
expressions that describe aspects of products (which we call aspect
expressions in this paper), and groups semantically related expres-
sions together. The following subtask, sentiment classification,
consists of recognizing the opinions associated with each aspect
and thence analyzing sentiment polarities for aspects [1,19,10,7].
Since the aspects extracted by the first subtask are the basis of
analysis in the second, the quality of opinion mining is thus signif-
icantly influenced by the performance of aspect extraction.
2.2. Traditional methods of aspect extraction
The key issues of aspect extraction are the identification of
aspect expressions and the categorization of semantically related
expressions. To address these issues, traditional frequent-string
based approaches extract frequent nouns and phrases as product
aspects [1,9–12]. A well-known limitation of these methods is that
they do not categorize the related expressions according to their
semantic content. Different attributes of the same product aspect
and domain-specific synonymous expressions are treated as differ-
ent aspects. The aspects extracted by these methods are too fine-
grained. The generated aspects lack of organization and are thus
of limited help in providing useful information to users of the
systems.
Some methods that categorize fine-grained aspects based on
lexical resources have been proposed [19,23]. However, they have
limited power in resolving the problem of domain-dependence,
since the expressions included in lexical resources is frequently
limited. In addition, some lexical synonyms may not describe the
same aspect in some domains. For example, the words ‘‘view’’
and ‘‘opinion’’ may be given as synonyms in a dictionary. In reviews
of cameras, the word ‘‘view’’, meaning the extent or range of vision,
is entirely unrelated to the word ‘‘opinion’’.
Other methods employ association rules and contextual infor-
mation to cluster semantically related aspects [24,25]. However,
the groups generated by these non-hierarchical clustering algo-
rithms may not be uniform because high-frequency terms generate
larger clusters than low-frequency terms. In [22,26], Zhai et al.
grouped some pre-extracted aspect expressions using both lexical
correlation and contextual similarity. However, these studies were
based on the assumption that the pre-extracted aspect expressions
were correct and thus did not take aspect expression identification
as part of the method.
2.3. Topic modeling
A topic model is a hierarchical Bayesian model. It introduces a
latent variable topic between the observed variables document
and word to analyze the semantic topic distribution of documents.
In topic models, each document is represented as a random mixture
over latent topics, where each topic is characterized by a distribu-
tion over words [13]. Nowadays, topic models have been widely
used to perform dimensionality reduction in Information Retrieval.
PLSA (Probabilistic Latent Semantic Analysis) [27] and LDA
(Latent Dirichlet Allocation) [13] are two widely-used models. In
PLSA, each document is represented as a vector of topic propor-
tions. However, PLSA has no probabilistic model for these propor-
tions. Consequently, the number of parameters in PLSA grows
linearly with the size of a corpus, which may lead to overfitting
problems. Furthermore, PLSA provides no generative process to
assign probabilities for new documents outside of training set [13].
To address these problems inherent in PLSA, Blei et al. propose
the LDA model in [13]. They define a Dirichlet probabilistic gener-
ative process for document-topic distribution. In each document, a
latent aspect z
i
2 Z is chosen according to the multinomial distri-
bution h ¼ Pðz j
a
Þ, which is controlled by Dirichlet prior
a
. Given
Fig. 1. An example of product descriptions from Newegg.com.
88 T. Wang et al. / Knowledge-Based Systems 71 (2014) 86–100