Recursive Autoencoder with HowNet Lexicon for
Sentence-Level Sentiment Analysis
Xianghua Fu
College of Computer Science and Software Engineering
Shenzhen University, Shenzhen Guangdong
518060, China
fuxh@szu.edu.cn
Yingying Xu
College of Computer Science and Software Engineering
Shenzhen University, Shenzhen Guangdong
518060, China
yingyingyulia@foxmail.com
ABSTRACT
Semantic word representations have been very useful but usually
ignore the syntactic relationship. In the task of sentiment analysis,
compositional vector representations require more structure
information from natural language text and richer supervised
training for more accuracy predictions. However, labeled data are
generally expensive to acquire in reality. To remedy this, we
propose a new method that train our model based on fully labeled
parse tree using supervised learning without manual annotation.
Our method not only significantly reduces the burden of manual
labeling, but also allows the compositionality to capture syntactic
and semantic information jointly. We show the effectiveness of
this model on the task of sentence-level sentiment classification
and conduct preliminary experiments to investigate its
performance. Lastly, it can accurately predict the sentiment
distribution and outperforms other approaches.
CCS Concepts
• Information systems➝Information retrieval➝Retrieval tasks
and goals ➝ Sentiment analysis • Information systems ➝
Information systems applications➝Data mining • Computing
methodologies ➝ Artificial intelligence ➝ Natural language
processing.
Keywords
Sentiment Analysis; Deep Learning; HowNet Lexicon; Parse Tree;
Word Embedding; Data Mining; Sentiment Label.
1. INTRODUCTION
Sentiment analysis is the task of identifying the subjectivity,
polarity (positive or negative) and polarity strength of a piece of
text. Depending on the subjective text, the granularity of the
analysis varies. In this research, we target at the task of sentence-
level sentiment analysis. It aims to classify the sentiment polarity
(such as positive or negative) of sentence based on the text
information.
Most previous studies follow Pang et al’s approach [14] and
regard sentiment analysis as a special case of text categorization
task. Traditional methods mainly adopt bag-of-words
representations, which is more suitable for longer documents by
relying on a few words with strong sentiment like ‘awesome’ or
‘exciting’, while may be not optimal for short messages. With the
deepening research of vector representation in recent years, word
embedding for sentiment analysis is widely concerned. Unlike
primitive word representation, word embedding represent a single
word as a dense, low-dimensional vector in a meaning space [2].
However, since it can only represent words, semantic composition
must be considered to represent phrases and sentences. Socher et
al. [18] exploits hierarchical structure and uses compositional
semantics to understand sentiment. However, the following
problems exist. (1) They use a greedy approximation constructs
the tree structure which doesn’t necessarily follow standard
syntactic constrains. (2) The internal nodes’ sentiment label used
to compute the loss function (cross-entropy) is missing. But
further progress towards understanding compositionality in tasks
such as sentiment analysis requires richer supervised training.
Then Socher et al. [19] introduce a Sentiment Treebank which is
the first corpus with fully labeled parse trees. When trained on the
new Treebank, even baseline methods can achieve improvement.
However, the high cost of manual annotation of training data for
supervised learning imposes a significant burden on their usage.
In order to overcome the above problems, we propose our novel
recursive autoencoder model. The major difference of our model
can be listed as follows:
(1) Rather than manually annotating sentiment labels for
nonterminal nodes, we use HowNet lexicon to compute every
nodes’ polarity. It significantly reduces the burden of manual
labeling.
(2) Instead of constructing binary tree by greedy algorithm, we
represent the structure of sentences using syntax trees. In this way,
the feature representations can capture as much of structure
information as possible.
(3) The characteristics of Chinese bring difficulty in sentiment
classification and so many previous works just exists in English
datasets. In our experiments, our evaluation datasets do not
contain English but also Chinese.
The remaining parts of this paper are organized as follows: In
Section 2 we introduce some related works. Section 3 describes
the model in detail. Experiments and evaluations are reported in
Section 4. The paper is closed with conclusion in Section 5.
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
ASE BD&SI 2015, October 07-09, 2015, Kaohsiung, Taiwan
© 2015 ACM. ISBN 978-1-4503-3735-9/15/10…$15.00
DOI: http://dx.doi.org/10.1145/2818869.2818908
Permission to make digital or hard copies of all or part of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or commercial
advantage and that copies bear this notice and the full citation on the
first page. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted.
To copy otherwise, or republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. Request
permissions from Permissions@acm.org.
ASE BD&SI 2015, October 07-09, 2015, Kaohsiung, Taiwan