3858 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 24, NO. 6, DECEMBER 2016
TABLE I
R
ETROFITTED SAMPLE SPAM FROM A TEMPLATE-BASED CAMPAIGN
an eye-catching action, and a URL. Each component has one
or more choices of textual content. The number of unique spam
messages that this template can potentially generate, therefore,
increases quickly with the number of components.
We formally model the true spam template as a macro
sequence (m
1
,m
2
,...,m
k
). We define two types of macros:
dictionary macros and noise macros. At the time of spam
generation, a dictionary macro picks the textual content from a
pre-defined list of choices. It is possible for a dictionary macro
to have only one choice. In this case, the macro reduces to an
invariant substring that all generated messages will contain. In
comparison, we abstract any macro that does not convey any
semantic meaning, but purely increases the message diversity
or increases the chance of exposing the spam to more users,
as a noise macro. The concatenation of the instantiation of
macros constitutes a spam message.
We assume that a template shall contain at least one
dictionary macro, while it may or may not contain any
noise macro. However, we do not assume the existence of
any invariant substring. Written in human language, a spam
message is not restricted to any particular expression to present
a semantic meaning. We have also observed spam template
without invariant substring in our data. Template generation
work [10], [11] that relies on invariant substrings.
Paraphrase Category: Consists of spam tweets that share
the same semantic meaning but cannot be uniformly divided
into semantically equivalent segments. Meanwhile, the tweets
do not share regular wording. We denote them as “paraphrase”
spam.
No-Content Category: Does not contain any semantically
meaningful sentence. Tweets in this category contain only
one URL, followed by a long list of popular keywords and
hashtags. Obviously, the spammers rely on the keywords and
hashtags to increase the chance to expose the URLs to users
when they browse tweets by topics.
Other Spam: Consists of the remaining spam that we have
not systematically categorized.
C. Template-Based Spam Keeps Dominating
We further categorize the spam generated in January, 2012
and October, 2014 in the same way. Table II provides the
popularity of four spam categories in June/July, 2011,
January, 2012, October, 2014, and January, 2015.
Template-based spam remains to be the most popular
category in 2012, 2014 and 2015, with its percentage
increasing to 68.3%, 78.1%, and 76.4%, respectively.
The no-content category almost vanishes. Its percentage
TABLE II
T
HE POPULARITY OF FOUR SPAM CATEGORIES IN JUNE/JULY, 2011,
J
ANUARY, 2012, OCTOBER, 2014, AND
JANUARY, 2015, RESPECTIVELY
Fig. 1. Tangram framework: The template generation and matching overview.
dramatically drops to 0.3%, 0.2%, and 0.3%, respectively.
It is possible that the no-content category exhibits strong
patterns and can be easily blocked. The increasingly popular
template-based spam indicates that our detection method with
focus on spam template generation is effective to combat
modern OSN spam.
III. T
ANGRAM:TEMPLATE-BASED
SPAM DETECTION SYSTEM
In this section, we present Tangram, an accurate and fast
template-based spam detection system. We first formulate
the notions of template, template matching and template
generation. Next, we detail the online Tangram system.
A. System Design Overview
Tangram builds template-based spam detection on top
of existing detection methods toward higher accuracy and
speed. It generates the underlying templates of spam detected
by various existing methods. It then uses the templates to
accurately, quickly match and detect spam. Figure 1 depicts the
Tangram workflow. It takes a stream of raw messages as input,
and classifies them as either spam or legitimate online. After
the classification, spam is filtered, while legitimate messages
pass through. Two components can classify messages: the
template matching module and the auxiliary spam filter. The
template matching module, along with the template generation
technique, is our major contribution. The auxiliary spam filter,
on the other hand, supplies training spam messages. It can be
any deployed spam filter, e.g., a blacklist spam filter.
Template Matching and Template Generation: We define a
template to be a sequence of macros of two types, dictionary
and noise (Section II-B). We represent a dictionary macro as
a set of values separated by “|” and a noise macro as “.*”.
Thus, templates produced by Tangram are naturally encoded as
regular expressions, specifically concatenations of “|”clauses
and “.*”s. Template matching matches a given message against
the corresponding regular expression. A successful template
match implies the tested message instantiates the template,
and should be flagged as spam. We define template generation