IET Information Security
Research Article
Back-propagation neural network on Markov
chains from system call sequences: a new
approach for detecting Android malware with
system call sequences
ISSN 1751-8709
Received on 10th December 2014
Revised 29th November 2015
Accepted on 28th December 2015
E-First on 15th March 2016
doi: 10.1049/iet-ifs.2015.0211
www.ietdl.org
Xi Xiao
1
, Zhenlong Wang
1
, Qing Li
1
, Shutao Xia
1
, Yong Jiang
1
1
Graduate School at Shenzhen, Tsinghua University, 518055 Shenzhen, People's Republic of China
E-mail: li.qing@sz.tsinghua.edu.cn
Abstract: Android has become the most prevalent mobile system, but in the meanwhile malware on this platform is widespread.
System call sequences are studied to detect malware. However, malware detection with these approaches relies on common
system-call-subsequences. It is not so efficient because it is difficult to decide the appropriate length of the common
subsequences. To address this issue, the authors propose a new approach, back-propagation neural network on Markov chains
from system call sequences (BMSCS). It treats one system call sequence as a homogeneous stationary Markov chain and
applies back-propagation neural network (BPNN) to detect malware by comparing transition probabilities in the chain. Since
transition probabilities from one system call to another in malware are significantly different from those in benign applications,
BMSCS can efficiently detect malware by capturing the anomaly in state transitions with the help of BPNN. The authors
evaluate the performance of BMSCS by experiments with real application samples. The experiment results show that the F-
score of BMSCS achieves up to 0.982773, which is higher than the other methods in the literature.
1 Introduction
Computer security has always been a serious problem. With mobile
terminals becoming more and more prevalent, mobile security is
increasingly prominent [1–4]. Due to the growing popularity and
openness, Android has attracted the most consideration of
malicious elements and a hacker can easily write malicious code
and spread it. Malware aiming specifically at Android devices has
increased at an alarming rate [5]. Furthermore, Android has unique
properties and specific limitations due to its mobile nature. This
makes it more difficult to detect malware with conventional
techniques. Therefore, it is rather important to develop a new and
efficient approach to detecting Android malware.
Researchers have explored two types of methods to detect
Android malware. The first type is static analysis, which aims to
recognise signatures of the malicious applications without actually
executing them [6–9]. Many binary forensic techniques can be
used in static analysis, including de-compilation, decryption,
pattern matching and so on. Yet these methods cannot detect
unknown malware as any application can have distinct signatures
by means of encryption and obfuscation [10]. Therefore, the
second type of methods, dynamic analysis, is proposed [11–17].
These approaches can monitor application's behaviours such as
network access, phone calling and message sending at run time.
The dynamic behaviours of an application are conducted by
system call sequences at the end. Therefore, researchers can
leverage system call sequences in the dynamic analysis [11–14].
The previous mechanism that uses system call sequences to detect
malicious applications usually consists of the following steps: first
generating common subsequences of system call sequences of
malware, second filtrating the common subsequences appearing in
system call sequences of benign applications. If the left common
subsequences exist in an application's system call sequence, the
application is identified as malware. Nevertheless, these methods
are inefficient and cannot achieve a desirable detection rate. The
critical limiting factor is the length of the common subsequence.
When the common subsequence is too short, the information used
to describe the action of an application is insufficient. However, the
action is the key character in identifying malicious applications.
When the common subsequence is too long, for instance, longer
than 45 system calls, it tends to be over fitting [14, 18].
Furthermore, it usually takes too much time to obtain the common
system-call-subsequences. The longer the common subsequence is,
the more time it takes (even weeks) [18].
To overcome the above shortcoming, in this paper we put
forward a new approach for Android malware detection, back-
propagation neural network on Markov chains from system call
sequences (BMSCS). The Markov chain has been employed in the
field of network security [19, 20] and the Markov logic network
has been adopted in Android malware detection [21]. Inspired by
Xiao et al. [22], where they applied homogeneous stationary
Markov chains to masquerade detection, we introduce this model
in mobile malware detection. Based on the fact that there are some
specific correlations between the adjacent system calls (e.g. first
memory access, second screen display, then user input
requirement), we treat the system call sequence activated by one
application as one Markov chain. To get low time complexity, we
only take two state dependency into consideration. Each distinct
system call corresponds to one unique state in the chain. There are
196 system calls in Android 4.0.4, thus the state number is 196. In
[19, 20], the numbers of states in these Markov chains are
relatively small. However, there are 196 states in our method.
Hence, it cannot solve the problem only by directly analysing each
element in the matrices. Some classifiers are required to help
further process these matrices.
In our scheme, we first calculate the transition probability
matrices by statistical methods and then convert them into vectors
of 196 × 196 dimensions. Our key assumption is that the
probabilities of transition from one system call to another are
significantly different between malicious applications and benign
ones. According to this assumption, the above vectors are fed to the
classifier, artificial neural network (ANN), to discriminate malware
from benign applications on Android. The classification process
consists of the training phase and the detection phase. During the
training phase, the ANNs (neural networks, in abbreviation) are
trained by back-propagation algorithm, which are called as back-
propagation neural networks (BPNNs). Finally, we do the
experiments on the malware from [23] and the benign applications
downloaded from Google. The results indicate that the F-score of
our method achieves up to 0.982773, higher than those of [8, 9,
14].
IET Inf. Secur., 2017, Vol. 11 Iss. 1, pp. 8-15
© The Institution of Engineering and Technology 2016
8