An Automatic Signature-Based Approach for Polymorphic Worms in Big Data
Environment
Fangwei Wang
Lab of Network and
Information Security of
Hebei Province
Hebei Normal University
Shijiazhuang, China
fw_wang@hebtu.edu.cn
Shaojie Yang
C Lab of Network and
Information Security of
Hebei Province
Hebei Normal University
Shijiazhuang, China
1657392397@qq.com
Dongmei Zhao
Lab of Network and
Information Security of
Hebei Province
Hebei Normal University
Shijiazhuang, China
dmzhao@hebtu.edu.cn
Changguang Wang
†
Lab of Network and
Information Security of
Hebei Province
Hebei Normal University
Shijiazhuang, China
wangcg@hebtu.edu.cn
Abstract—In a big data environment, the signatures of
polymorphic worms need to be extracted accurately and
efficiently, which is of great importance to prevent them. At
present, however, it is difficult to generate the accurate
signature for polymorphic worms, especially under the noise
condition. To solve this issue, we propose an automatic
signature extraction algorithm for polymorphic worms based
on the improved Term Frequency-Inverse Document
Frequency (TF-IDF). Firstly, each sample of the dataset is
divided into some documents. One document is selected
randomly and its fist worm sample is analyzed. Then the
suspicious substring is selected by calculating the TF value
through traversing the document. Secondly, all the documents
are traversed and the IDF value is figured out. Finally, the TD-
IDF value is determined and the accurate worm signature is
generated. This algorithm is tested by various kinds of worms
and compared with the existing methods. The results show that
our algorithm can generate polymorphic worm signatures
more accurately and efficiently compared with similar
methods under the noise condition. It can also save the state of
worm signature extraction and has excellent scalability.
Keywords-Polymorphic worm; Signature extraction; TF-
IDF; Worm detection
I. INTRODUCTION
Along with the globalization of the Internet and the
arrival of big data era, network worms have become a most
serious threat to network security and data security and
caused a lot of losses, whose propagation evolves from the
mode of human-machine interaction relied on hardware
devices to automatic duplication and propagation rested with
global network, operating system and application software
[1-3]. The polymorphic worm is a kind of worms that can
change its appearance with each infection with the help of
variation, encryption and semantics-preserving. Its signatures
present composability and are difficult to describe by the
traditional single signature, which greatly challenges the
traditional methods of worm detection and defense.
Therefore, it becomes major research subject to detect the
polymorphic worm rapidly and generate its signatures
quickly.
The main method suitable for detecting polymorphic
worms is to extract attack signatures by analyzing the
suspicious traffic, which does not need the host information,
the source code of vulnerabilities and the binary codes. It is
based on the existing technologies of signature extraction
and improved by combining the own signatures of
polymorphic worms. It can not only detect the known worms
but also detect the new samples of polymorphic worms well
and more accurately.
The thought that the worm attack signatures are extracted
automatically was first put forward in the “Honeycomb-
creating” [4]. Though it proposed the automatic extraction
idea of worm signatures, it could not collect enough data to
analyze the worm and extract its signature due to a smaller
number of the infected hosts at the beginning of worm
propagation. Thus, this work did not fully reflect the
advantages of automatic extraction. Autograph [5] system
could generate worm signatures according to the content
length based on single string matching. This system can
provide some reference to extract worm signatures, but the
classification of the worm signatures generated by the system
is too onefold to detect more sophisticated polymorphic
worms. Newsome et al. [6] first proposed a system to detect
polymorphic worms by use of Polygraph, which used some
substrings to generate three types of signatures (Conjunction
signature, Token-subsequence signature, and Bayes
signature) to extract invariant that satisfied the required
conditions from the suspicious flow. However, the signatures
generated by the system showed poor performance and a
high false alarm rate under the condition of noise. It was also
helpless for the polymorphic worms which adopt instruction
substitution, NOP, and instruction transformation, and
difficult to realize rapid signature extraction.
Wang et al. [7] proposed a network-based method to
generate signatures for polymorphic worms, which could
generate length-based signatures for buffer overflow
vulnerabilities. Stephenson et al. [8] proposed a quasi-
species model to describe the propagation of polymorphic
worms and obtained the maximum allowable time of
preventing network worms. Sun et al. [9] proposed an
RSWD (Rough Set Worm Detection) algorithm to detect
polymorphic worms based on rough set theory. Iwahashi et
al. [10] suggested using Petri Net to generate a worm
signature automatically. Tang et al. [11-12] utilized the gene
sequence alignment method in bioinformatics to generate
223
2019 International Conference on Networking and Network Applications (NaNA)
978-1-7281-2629-6/19/$31.00 ©2019 IEEE
DOI 10.1109/NaNA.2019.00047