Analyzing Malware by Abstracting the Frequent Itemsets in API call Sequences
Yong Qiao, Jie He
School of Computer
National University of Defense Technology
Changsha, China
qiaoyong10@nudt.edu.cn
Yuexiang Yang, Lin Ji
Information Center
National University of Defense Technology
Changsha, China
yyx@nudt.edu.cn
Abstract—Analyzing the usage of Windows Application
Program Interface (API) is a common way to understand
behaviors of Malicious Software (malware) in either static
analysis or dynamic analysis methods. In this work, we focus
on the usage of frequent messages in API call sequences, and
we hypothesize that frequent itemsets composed of API names
and/or API arguments could be valuable in the identification
of the behavior of malware. For verification, we introduced
clustering processes of malware binaries based on their
frequent itemsets of API call sequences, and we evaluated the
performance of malware clustering. Specific implementation
processes for malware clustering, including API calls
abstraction, frequent itemsets mining and similarity
calculation, are illustrated. The experiment upon a big
malware dataset demonstrated that merely using the frequent
messages of API call sequences can achieve a high precision
for malware clustering while significantly reducing the
computation time. This also proves the importance of frequent
itemsets in API call sequences for identifying the behavior of
malware.
Keywords-malware; frequent-itemsets; API call sequences;
clustering; Sandbox
I. INTRODUCTION
Malicious software, or malware for short, ranging from
classic computer viruses to Internet worms, bots and Trojan
horses, have brought countless hazards to nowadays
Internet, especially to the most prevalent desktop operating
system – Microsoft Windows. Various methods have been
proposed for malware analysis. One of the most common
methods is to analyze the behavior of malware by
abstracting the API calls. This is because most applications
running in user mode need to call Windows API functions
for requesting services from the kernel in Windows.
Therefore, the sequence of API calls is capable to represent
the main behavior of a run-time application in such period.
Moreover, we can further analyze the function parameters
and track the information flows from the API call sequences
[1].
As early as 1998, Hofmeyr et al.[2] studied the intrusion
detection in UNIX systems using sequences of system calls.
Similarly, Shankarapani et al.[3] proposed two frameworks,
called SAVE (Static Analyzer for Vicious Executables) and
MEDic (Malware Examiner using Disassembled Code), to
detect malicious codes using the API call sequences or
static API call set. However, the methods mentioned above
those abstract API calls of malware by static codes
disassemblers, like Win32Dasm, OllyDbg, et al., are
challenged by the continuously updated obfuscation and
packing techniques, which makes it difficult for the static
methods to get correct API calls. Therefore, novel methods
to dynamically abstract API calls have been proposed
recently [4-9]. In contrast to static techniques, dynamic
analysis methods monitor the behavior of malware during
run-time, which is often indicative for malicious activity
and hard to conceal.
Although abstracting the API calls during run-time
provides means for studying the behavior of malware, it is
not sufficient to detect malware. The ability to analyze the
API calls automatically and deeply is also required. To the
best of our knowledge, two teams have made good progress
in terms of analysis for the sequences of API calls. The first
framework is proposed by Rieck et al. [5], which embeds
the API call sequences to high-dimensional vector spaces,
and allows for malware clustering and classification with
high precision. The second one is introduced by Ahmed et
al. [8], which is able to detect malware using statistical
features that are extracted from both spatial (arguments) and
temporal (sequences) information from API calls. Both
frameworks use full sequences of API calls as input
parameters for similarity calculation (the similarity between
one binary to another binary or one binary to a model).
However, not all information from API calls is valuable for
calculating the similarity between different malicious
binaries. On the contrary, in many cases, redundant
messages can reduce the similarity even two binaries
originally belonging to the same category. Moreover,
calculating upon whole sequences is time-consuming,
especially when dealing with a large number of malware
binaries.
We consider that the frequent messages of API call
sequences, like frequent arguments or frequent itemsets
composed of API names and API arguments, play an
important role in reflecting the behavior of malware. In this
paper, we ran clustering processes of malware binaries
based on their frequent itemsets of API call sequences, and
we evaluated the performance of malware clustering. We
then explained the specific processes of this work with a
formal description of the frequent API itemsets mining
method. At last, we utilized a big malware dataset including
3131 malware binaries for executing clustering based on the
frequent messages of API call sequences, the results of