Twitter事件检测算法：短时与长期趋势的智能识别

需积分: 10 152 浏览量更新于2024-07-18 1 收藏 1.86MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

本文档深入探讨了在Twitter微博平台上的事件检测技术，发表于IEEE Transactions on Cybernetics，2016年12月刊，卷46，第12期，具有较高的学术影响力（影响因子10分以上）。随着每天数百万条推文的海量涌现，用户往往难以筛选出他们感兴趣的信息，这就凸显了在Twitter上实现有效事件检测算法的必要性。研究者们针对这一挑战，提出了两种关键的解决方案：一是基于模糊表示和及时演变的推文理论信息度量，用来捕捉Twitter动态；二是利用黎曼距离，减少因提交延迟带来的时间效应，对单词签名进行优化处理。论文首先关注的是短时事件识别，即实时发现正在发生的事情。为了做到这一点，研究人员构建了以推文为基础的信息度量模型，通过模糊逻辑和动态更新机制，确保模型能够准确反映当前的热门话题和趋势。这种度量方法不仅考虑了推文的内容，还考虑了它们在网络中的传播和影响力。对于长期事件回顾，论文提出了一种方法，即通过对最近提交的、具有重要意义的事件进行回顾，以捕捉更深层次的历史事件脉络。这同样依赖于推文信息的量化和组织，但时间跨度更长，涉及对历史数据的分析和归纳。核心的事件检测技术是通过一个多任务分配图分割算法实现的。该算法的目标是在保持每个簇内部最大一致性的同时，允许一个词关联到多个事件簇，从而提高事件识别的灵活性和准确性。这种方法有助于识别具有多维度含义或跨主题的相关事件，提高了整体的事件分类性能。作者通过实证研究验证了他们的方法，使用现实生活中的大量数据集展示了新提出的算法在实际应用中的效果。结果显示，相比于传统的方法，这种基于推文和黎曼距离的事件检测策略在准确性和效率上都表现出了显著优势，能够帮助用户更有效地过滤和理解Twitter上的信息流。总结来说，这篇论文在Twitter事件检测领域做出了重要贡献，为用户提供了一种创新的信息处理框架，能够在瞬息万变的社交媒体环境中快速、准确地发现和理解各类事件。这对于信息聚合、舆情监控以及社交媒体营销等领域都有着实际的应用价值。

资源详情

资源推荐

2812 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 46, NO. 12, DECEMBER 2016

tweets containing the word w over all the N(k) tweets col-

lected at time interval k. We deﬁne the conditional word

tweet frequency (CWTF) at time instance k and for a given

word w as

CWTF(k, w) = N

(w)

(k)/N(k). (1)

The main difference of CWTF from the classical description

of TF is that, here we count the number of tweets that contain

a speciﬁc word within the current examined time interval k

instead of counting the number of times that a word appears

within a document. That is, all tweets that contain the speciﬁc

word contribute the same to the calculation of CWTF. Thus,

CWTF models a conditional distribution of tweets fre-

quency, i.e., tweets under the condition that they contain

the word w.

We deﬁne the inverse trend word tweet frequency (ITWTF)

as a metric that assesses how frequently tweet posts contain

the speciﬁc word w over p previous time intervals, (t

k−1

−

β, t

k−1

],...,(t

k−p

− β, t

k−p

]. In particular, we have that

ITWTF(k, p, w) = log



i=1

N(k − i)



i=1

(w)

(k − i)

. (2)

In contrast to the conventional IDF score, ITWTF is a time

varying metric that evolves as new time intervals are taken

into account. A word that is rarely frequent up to the cur-

rent examined time interval k will receive high values of

ITWTF. However, if this word becomes trendy at the current

time interval k, the CWTF score will take high values, forcing

the product CWTF*ITWTF to be high. As long as this word

remains trendy, in the forthcoming time intervals the ITWTF

score will start to decay forcing the product CWTF*ITWTF

to start decreasing as well. This means that, events that

have been extracted as salient at previous stages will start

to have less impact in the forthcoming stages. The product

CWTF*ITWTF is the ﬁrst tweet-based information theoretic

metric ϑ

(k, w)

(k, w) =

(w)

(k)

N(k)

· log



i=1

N(k − i)



i=1

(w)

(k − i)

. (3)

2) Word Frequency–Inverse Trend Word Frequency: The

second metric ϑ

(k, w) considers the frequency of appearance

of w in the tweets within the kth interval, denoted by C

(w)

(k).

We also denote by C(k) the total number of words that appear

within the N(k) tweets. Then, metric ϑ

(k, w) is deﬁned as

(k, w) =

(w)

(k)

C(k)

· log



i=1

C(k − i)



i=1

(w)

(k − i)

. (4)

The ﬁrst term of (4) is designed to measure word fre-

quency (WF) appearance at the current kth time interval, while

the second term expresses the ITWF score, making ϑ

(k, w)

also a time varying signal. The main difference between the

metrics ϑ

(k, w) and ϑ

(k, w) is that in ϑ

(k, w) the signiﬁ-

cance of a word over the corpus of tweets at time interval k is

independent of the number of words a tweet has, with tweets

of few or many words contributing equally to the metric.

The opposite holds for metric ϑ

(k, w) of (4).

Fig. 1. Operation of the proposed fuzzy representation.

3) Weighted Conditional Word Tweet Frequency–Inverse

Trend Weighted Conditional Word Tweet Frequency: The third

metric, ϑ

(k, w), considers Twitter speciﬁc parameters, such as

the number of followers and retweets. The number of follow-

ers indicates authors’ credibility. The number of retweets is

a metric for ranking the importance of the textual content. In

particular, we denote by f

(k), m = 1,...,N(k), the num-

ber of followers for the mth tweet at time k, and as r

(k)

the number of retweets. Then, p

(k) = f

(k)/



N(k)

m=1

(k)

and p

(k) = r

(k)/



N(k)

m=1

(k) are their normalized values.

Then

(k, w)



N(k)

m=1

(k) · p

(k) · i

(w, k)



N(k)

m=1

(k) · p

(k)

× log



j=1



N(k−j)

m=1

(k − j) · p

(k − j)



j=1



N(k−j)

m=1

(k − j) · p

(k − j) · i

(w, k − j)

(5)

where i

(w, k) is an indicator function that equals one if the

mth tweet contains the word w, and zero otherwise.

C. Fuzzy Tweet-Based Representation

We form a time series signal, denoted as x

(k), that contains

the tweet-based information theoretic metrics of (3)–(5) over

a time period of time intervals

(k) =

[

ϑ(k, w)ϑ(k − 1, w) ···

]

. (6)

In (6), variable ϑ(k, w) refers to one of the three metrics

deﬁned in (3)–(5). Each element ϑ(k, w) of the time series

signal x

(k) expresses the degree of importance of word w at

the kth time interval and in a nonfuzzy representation is calcu-

lated independently of each other. However, in our proposed

fuzzy representation, metric ϑ(k, w) for the kth interval is dif-

fused over K previous intervals but with a different degree of

membership for each interval

(k, w) =

K−1



i=0

ϑ(k − i, w)

∗

(k − i) (7)

where subscript f denotes the fuzzy representation of the

respective metric and μ

(k−i) is the fuzzy membership degree

for the (k−i)th time interval. The μ

takes values in the range

[0, 1]. Usually triangular functions are used to obtain values

of μ

but any other fuzzy function can be also adopted. Values

of μ

near unity (zero) indicate high (low) degree of member-

ship of the metric. Other types of diffusion methods can also

剩余14页未读，继续阅读

程勇uestc

粉丝: 1916
资源: 1

Twitter事件检测算法：短时与长期趋势的智能识别

微博开放领域的事件抽取

事件检测方案算法及流程

推特oauth_token

在推特上使用grok

Python爬虫爬取推特

python爬取推特评论

如果应用商店没有推特怎么办

vue点击事件新标签页打开推特官网

推特有哪些nlp算法？

python爬取推特推文

python爬取推特图片

推特有哪些nlp算法代码？

python爬取推特的详细教程

推特 使用国际SIM卡

js推特怎么带图片分享

基于Transformer的疫情期间推特情感分析-GPU

html点击推特号跳转推特

推特对string优化的原理

分析推特中nosql技术的应用和优化

推特爬虫python

最新资源

推特使用国际SIM卡