Make It Possible: Multilingual Sentiment Analysis
without Much Prior Knowledge
Zheng Lin, Xiaolong Jin, Xueke Xu, Yuanzhuo Wang, Songbo Tan, Xueqi Cheng
CAS Key Laboratory on Network Data Science and Technology,
Institute of Computing Technology, Chinese Academy of Sciences,
Beijing, 100190, China
Email: {linzheng, jinxiaolong, xuxueke, wangyuanzhuo, tansongbo, cxq}@ict.ac.cn
Abstract—Sentiment analysis is a hard problem, while mul-
tilingual sentiment analysis is even harder due to the different
expression styles in different languages. Although many methods
for multilingual sentiment analysis have been developed in the
open literature, most of them suffer from two major problems.
The first is their excessive dependence on external tools or
resources (e.g., machine translation systems or bilingual dictio-
naries), which may not be readily obtained, especially for minority
languages; The second is conflictive sentiments, i.e., the sentiment
polarity of some parts of a text is inconsistent with its overall
sentiment polarity. It is observed that in a product or service
review there usually exist a few sentences which play a more
important role in determining its sentiment polarity, as compared
to others. Therefore, differentiating key sentences from trivial
ones may be helpful to improve sentiment analysis. Inspired by
this observation in this paper we propose a novel framework to
estimate the sentiment polarity of reviews by virtue of opinion
lexica and key sentences automatically extracted from unlabelled
data. This framework cannot only overcome the problem of
excessive dependence on external resources, but also is able to
capture the overall sentiment polarity of reviews. Experimental
results on realistic review datasets demonstrate that the proposed
framework is effective and competitive with the representative
baselines.
I. INTRODUCTION
Sentiment analysis [10] aims to automatically identify the
sentiment polarity of given texts, which has broad applications,
including recommendation systems [23], sentiment summa-
rization [7], opinion retrieval [17], and so on. Given the
explosively growing number of online reviews in different lan-
guages, multilingual sentiment analysis has recently attracted a
great deal of attention from both academia and industries [3],
[8], [16], [26]. According to the resources employed, existing
methods for multilingual sentiment analysis can basically be
categorized into two types, namely, machine-translation-based
methods and bilingual-dictionary-based methods.
Machine translation (MT) has been widely employed in
cross-language related work. For example, it is often used to
translate the labelled data in a source language into a target
language [2], [4], [25]. However, such machine-translation-
based methods are confronted with three problems: First, they
are inefficient when dealing with massive data; Second, current
MT systems are not powerful to achieve accurate results.
Particularly, they usually generate one best translation, which
may not be suitable for the situation at hand; Third, the models
used in statistical MT rely on a set of characteristics observed
on training examples, but large-scale bilingual parallel corpora
for a specific domain are not available in some cases.
Utilizing bilingual dictionaries [12] in multilingual senti-
ment analysis could be effective as the methods using a high-
quality MT system [19], [22]. Bilingual dictionaries cannot
only reduce workload for labelling data, but also allow one
integrating various term weighting and selection methods.
However, comprehensive bilingual dictionaries may not be
always available, especially for minority language pairs, while
generating a bilingual dictionary is difficult and laborious.
In addition to the above issue of resource dependency,
another grand challenge of multilingual sentiment analysis is
sentiment analysis itself. Sentiment analysis is a hard problem,
because many reviews are sentimentally ambiguous for many
reasons. For instance, objective statements interleaved with
subjective statements can be confusing for learning methods,
and subjective statements with conflictive sentiments further
make sentiment analysis more complicated [29]. Take a book
review for example:
This book is beautiful.
......
Zusak’s novel, set in a small town outside Munich during World War II,
chronicles the story of Liesel Meminger, a German girl taken into Hans
Huberman’s household as a foster child. As likeable as she is well-developed,
it’s amazing to watch a young girl like that remain so strong in the face of
human tragedy, impossible hatred......
Here, the reader describes the trivial plot using negative
words such as “war” and “tragedy”. But, s/he enthusiastically
expresses that s/he likes the book at the beginning of the
review. In this case, the overall sentiment polarity of the review
is positive, but is apt to be labelled as a negative one if
all sentences are treated equally. In the case of multilingual
sentiment analysis where the different expression styles in
different languages and cultures are considered, the conflictive
sentiments problem becomes more difficult.
To solve the above problems, in this paper we propose
a novel multilingual sentiment analysis framework. In the
proposed framework, no manually labelled corpus is needed
and all extracted information is domain-dependent. In general,
the contributions of this study can be summarized as follows:
1) We propose a statistical method for opinion lexicon
extraction based on a few seed words, which can
be easily transplanted to almost any language and
does not need to refer to synonyms and antonyms
dictionaries;
2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
978-1-4799-4143-8/14 $31.00 © 2014 IEEE
DOI 10.1109/WI-IAT.2014.83
79
2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)
978-1-4799-4143-8/14 $31.00 © 2014 IEEE
DOI 10.1109/WI-IAT.2014.83
79