Protocol
C la s s i f i e r
U D P
1 4 3 4
H a m s a
S i g n a tu re
G e n e ra tor
W orm
F low
C la s s i f i e r
T C P
1 3 7
. . .
T C P
8 0
T C P
5 3
T C P
2 5
N orm a l
T ra f f i c Pool
S u s p i c i ou s
T ra f f i c Pool
S i g n a tu re s
N e tw ork
T a p
K n ow n
W orm
F i lte r
N orm a l tra f f i c
re s e rv oi r
R e a l ti m e
Poli cy d ri v e n
Figure 2. Architecture of Hamsa Monitor
Token
E x t r a c t or
Tokens
F i l t er
Pool size
t oo sm a ll?
N O
S u s p i c i ou s
Tr a f f i c P ool
N or m a l
Tr a f f i c P ool
Y E S
Q u i t
S i g na t u r e
R ef i ner
S i g na t u r e
Token
I d ent i f i c a t i on
C or e
Figure 3. Hamsa Signature Generator
and evaluate Hamsa in Section 7. Finally we compare with
related work in Section 8 and conclude in Section 9.
2 Problem Space and Hamsa System Design
2.1 Two Classes for Polymorphic Signa-
tures
Signatures for polymorphic worms can be broadly classi-
fied into two categories - content-based and behavior-based.
Content-based signatures aim to exploit the residual similar-
ity in the byte sequences of different instances of polymor-
phic worms. There are two sources of such residual similar-
ity. One is that some byte patterns may be unavoidable for
the successful execution of the worm. The other is due to
the limitations of the polymorphism inducing code. In con-
trast, behavior based signatures ignore the byte sequences
and instead focus on the actual dynamics of the worm exe-
cution to detect them.
Hamsa focuses on content-based signatures. An ad-
vantage of content-based signatures is that they allows us
to treat the worms as strings of bytes and does not depend
upon any protocol or server information. They also have
fast signature matching algorithms [27] and can easily be
incorporated into firewalls or NIDSes. Next we discuss the
likelihood for different parts of a worm (, γ, π) [5] to con-
tain invariant content.
• is the protocol frame part, which makes a vulnerable
server branch down the code path to the part where a
software vulnerability exists.
• γ represents the control data leading to control flow
hijacking, such as the value used to overwrite a jump
target or a function call.
• π represents the true worm payload, the executable
code for the worm.
The part cannot be freely manipulated by the attackers
because the worm needs it to lead the server to a specific
vulnerability. For Codered II, the worm samples should
necessarily contain the tokens “ida” or “idq”, and “%u”.
Therefore, is a prime source for invariant content. More-
over, since most vulnerabilities are discovered in code that
is not frequently used [5], it is arguable that the invariant in
is usually sufficiently unique.
For the γ part, many buffer overflow vulnerabilities need
to hard code the return address into the worm, which is a
32-bit integer of which at least the first 23-bit should ar-
guably be the same across all the worm samples. For in-
stance, the register springs can potentially have hundreds of
way to make the return address different, but use of regis-
ter springs increases the worm size as it needs to store all
the different address. It also requires considerable effort to
look for all the feasible instructions in libc address space for
register springing.
For the π part, a perfect worm using sophisticated en-
cryption/decryption (SED) may not contain any invariant
content. However, it is not trivial to implement such per-
fect worms.
As mentioned in [5], it is possible to have a perfect worm
which leverages a vulnerability by using advanced register
springs and SED techniques does not contain any invari-
ance. This kind of a worm can evade not only our system,
but any content-based systems. But in practice such worms
are not very likely to occur.
2.2 Hamsa System Design
Figure 2 depicts the architecture of Hamsa which is sim-
ilar to the basic frameworks of Autograph [10] and Poly-
graph [16]. We first need to sniff the traffic from networks,
assemble the packets to flows, and classify the flows based
on different protocols (TCP/UDP/ICMP) and port numbers.
Then for each protocol and port pair, we need to filter out
the known worm samples and then separate the flows into
the suspicious pool (M) and the normal traffic reservoir us-
ing a worm flow classifier. Then based on a normal traf-