000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
CVPR
#262
CVPR
#262
CVPR 2015 Submission #262. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Cross-modality Consistent Regression for Joint Visual-Textual Sentiment
Analysis
Anonymous CVPR submission
Paper ID 262
Abstract
Sentiment analysis of online user generated content is
important for many social media analytics tasks. Re-
searchers have largely relied on textual sentiment analysis
to develop systems to predict political elections, measure e-
conomic indicators, and so on. Recently, social media users
are increasingly using additional images and videos to ex-
press their opinions and share their experiences. Sentiment
analysis of such large-scale textual and visual content can
help better extract user sentiments toward events or topics.
Motivated by the needs to leverage large-scale social mul-
timedia content for sentiment analysis, we propose a cross-
modality regression (CCR) model, which is able to utilize
both the state-of-the-art visual and textual sentiment anal-
ysis techniques. We first fine-tune a convolutional neural
network (CNN) for image sentiment analysis and train a
paragraph vector model for textual sentiment analysis. On
top of them, we train our multi-modality regression model.
We use sentimental queries to obtain half a million training
samples from Getty Images. We have conducted extensive
experiments on both machine weakly labeled and manual-
ly labeled image tweets. The results show that the proposed
model can achieve better performance than the state-of-the-
art textual and visual sentiment analysis algorithms alone.
1. Introduction
The increasing popularity of social networks attracts
more and more people to share their experiences and to ex-
press their opinions on virtually all events and subjects in
online social network platforms. Each day, billions of mes-
sages and posts are generated. In this study, we focus on
deriving people’s opinions or sentiments towards topics and
events happening in real world. In other words, we are in-
terested in automatical detection of sentiment from online
user generated content.
Figure 1 shows several example image tweets from Twit-
PD Achilles meets a new
friend. Special post for
one of our followers who
I met last night and had a
good chat to
If anyone woke up in
edinburgh this morning
to discover their car
missing i think i know
where it is
Hello there sweetie. :)
(a) (b) (c)
Figure 1. Examples of image tweets from Twitter.
ter. Image tweets refer to those tweets that contain images.
If we take a look at these three example image tweets, we
can observe that in example (a), both image and the text
indicate that this tweet carries a positive sentiment; in (b)
while it is difficult to tell the sentiment from the image in
the middle image tweet, however, we can tell that this tweet
expresses positive sentiment from the text; in (c) on the con-
trary, it is hard to tell the sentiment from the text, however
the worn-out car in the image suggest an overall negative
sentiment. These examples explain the motivation for our
work. We would like to learn people’s overall sentiment
over the same object from different modalities of the object
provided by the user. In particular, we focus on inferring
people’s sentiment according to the available images and
the short and informal text.
Many researchers have contributed to sentiment analy-
sis. For instance, there are related works on detecting users’
sentiment and applying sentiment analysis to predict box-
office revenues for movies [1], political elections [22, 28]
and economic indicators [3, 31]. In particular, recently pub-
lished works started to focus on analyzing sentiment of in-
formally user generated content from online social network-
s. However, current techniques are mostly based on the
analysis of textual content to detect sentiment. On the oth-
er hand, visual content, including both images and videos,
are becoming increasingly popular in all mainstream online
1