2812 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 46, NO. 12, DECEMBER 2016
tweets containing the word w over all the N(k) tweets col-
lected at time interval k. We define the conditional word
tweet frequency (CWTF) at time instance k and for a given
word w as
CWTF(k, w) = N
(w)
(k)/N(k). (1)
The main difference of CWTF from the classical description
of TF is that, here we count the number of tweets that contain
a specific word within the current examined time interval k
instead of counting the number of times that a word appears
within a document. That is, all tweets that contain the specific
word contribute the same to the calculation of CWTF. Thus,
CWTF models a conditional distribution of tweets fre-
quency, i.e., tweets under the condition that they contain
the word w.
We define the inverse trend word tweet frequency (ITWTF)
as a metric that assesses how frequently tweet posts contain
the specific word w over p previous time intervals, (t
k−1
−
β, t
k−1
],...,(t
k−p
− β, t
k−p
]. In particular, we have that
ITWTF(k, p, w) = log
p
i=1
N(k − i)
p
i=1
N
(w)
(k − i)
. (2)
In contrast to the conventional IDF score, ITWTF is a time
varying metric that evolves as new time intervals are taken
into account. A word that is rarely frequent up to the cur-
rent examined time interval k will receive high values of
ITWTF. However, if this word becomes trendy at the current
time interval k, the CWTF score will take high values, forcing
the product CWTF*ITWTF to be high. As long as this word
remains trendy, in the forthcoming time intervals the ITWTF
score will start to decay forcing the product CWTF*ITWTF
to start decreasing as well. This means that, events that
have been extracted as salient at previous stages will start
to have less impact in the forthcoming stages. The product
CWTF*ITWTF is the first tweet-based information theoretic
metric ϑ
1
(k, w)
ϑ
1
(k, w) =
N
(w)
(k)
N(k)
· log
p
i=1
N(k − i)
p
i=1
N
(w)
(k − i)
. (3)
2) Word Frequency–Inverse Trend Word Frequency: The
second metric ϑ
2
(k, w) considers the frequency of appearance
of w in the tweets within the kth interval, denoted by C
(w)
(k).
We also denote by C(k) the total number of words that appear
within the N(k) tweets. Then, metric ϑ
2
(k, w) is defined as
ϑ
2
(k, w) =
C
(w)
(k)
C(k)
· log
p
i=1
C(k − i)
p
i=1
C
(w)
(k − i)
. (4)
The first term of (4) is designed to measure word fre-
quency (WF) appearance at the current kth time interval, while
the second term expresses the ITWF score, making ϑ
2
(k, w)
also a time varying signal. The main difference between the
metrics ϑ
1
(k, w) and ϑ
2
(k, w) is that in ϑ
1
(k, w) the signifi-
cance of a word over the corpus of tweets at time interval k is
independent of the number of words a tweet has, with tweets
of few or many words contributing equally to the metric.
The opposite holds for metric ϑ
2
(k, w) of (4).
Fig. 1. Operation of the proposed fuzzy representation.
3) Weighted Conditional Word Tweet Frequency–Inverse
Trend Weighted Conditional Word Tweet Frequency: The third
metric, ϑ
3
(k, w), considers Twitter specific parameters, such as
the number of followers and retweets. The number of follow-
ers indicates authors’ credibility. The number of retweets is
a metric for ranking the importance of the textual content. In
particular, we denote by f
m
(k), m = 1,...,N(k), the num-
ber of followers for the mth tweet at time k, and as r
m
(k)
the number of retweets. Then, p
f
m
(k) = f
m
(k)/
N(k)
m=1
f
m
(k)
and p
r
m
(k) = r
m
(k)/
N(k)
m=1
r
m
(k) are their normalized values.
Then
ϑ
3
(k, w)
=
N(k)
m=1
p
f
m
(k) · p
r
m
(k) · i
m
(w, k)
N(k)
m=1
p
f
m
(k) · p
r
m
(k)
× log
p
j=1
N(k−j)
m=1
p
f
m
(k − j) · p
r
m
(k − j)
p
j=1
N(k−j)
m=1
p
f
m
(k − j) · p
r
m
(k − j) · i
m
(w, k − j)
(5)
where i
m
(w, k) is an indicator function that equals one if the
mth tweet contains the word w, and zero otherwise.
C. Fuzzy Tweet-Based Representation
We form a time series signal, denoted as x
w
(k), that contains
the tweet-based information theoretic metrics of (3)–(5) over
a time period of time intervals
x
w
(k) =
[
ϑ(k, w)ϑ(k − 1, w) ···
]
T
. (6)
In (6), variable ϑ(k, w) refers to one of the three metrics
defined in (3)–(5). Each element ϑ(k, w) of the time series
signal x
w
(k) expresses the degree of importance of word w at
the kth time interval and in a nonfuzzy representation is calcu-
lated independently of each other. However, in our proposed
fuzzy representation, metric ϑ(k, w) for the kth interval is dif-
fused over K previous intervals but with a different degree of
membership for each interval
ϑ
f
(k, w) =
K−1
i=0
ϑ(k − i, w)
∗
μ
k
(k − i) (7)
where subscript f denotes the fuzzy representation of the
respective metric and μ
k
(k−i) is the fuzzy membership degree
for the (k−i)th time interval. The μ
k
takes values in the range
[0, 1]. Usually triangular functions are used to obtain values
of μ
k
but any other fuzzy function can be also adopted. Values
of μ
k
near unity (zero) indicate high (low) degree of member-
ship of the metric. Other types of diffusion methods can also