Unveiling Inheritance of Android Malware APKs
based on Code Hierarchical API Usage
Ling Yang
School of Computer Science & Engineering
Nanjing University of Science & Technology
Nanjing, China
yangling_2014@njust.edu.cn
Songjie Wei
School of Computer Science & Engineering
Nanjing University of Science & Technology
Nanjing, China
swei@njust.edu.cn
Abstract—the threat of Android malware is uprising quickly with
the prevalence of smart mobile devices. Existing static analysis
techniques on Android APK focus more on package
configuration than on code itself. Android applications call
system APIs to implement critical functionalities and to carry out
software behaviors. Android malware of different types and code
families tend to share the same API usage pattern and call
structures. This research work proposes to characterize for each
APK its API-usage appearance on different hierarchical layers of
code, namely packets, classes and functions. A tree structure is
used to describe and present such API-usage information. By
comparing such tree structures, APK similarities can be
computed, and the algorithmically clustering of APKs to reveal
malware type/family categories becomes possible. Preliminary
experiments using real Android malware packages validates the
proposed APK characterization and clustering approaches, and
the result shows its effectiveness and precision promising.
Keywords - Android; Malware; Clustering, Static Analysis
I. I
NTRODUCTION
For the past two decades, we have witnessed the
revolutionary advance of Internet technology and the rapid
prevalence of mobile smart devices running various mobile
applications (APP) for Internet services. While people enjoy
the flexibility and convenience of using mobile APPs, the
malicious ones also bring in new concerns to us on information
and system security. According to a public NQ Mobile report
on mobile security [1], only in the first 6 months of 2013,
51,000 new threats have been identified on mobile platform,
which infected estimated 21 million mobile devices. More
serious than the absolute numbers are trends. More new mobile
malware APPs are spread in the first half of 2013 than from the
whole 2012. Android as the most installed and popularly used
mobile platform, suffers the majority of newly discovered
malicious APPs, or mobile malware. It is straightforward for
hackers with even limited software engineering and
programming skills to recompile Android application packets
(APK), copy its functionalities, insert a piece of malicious code,
repackage the code and publish elsewhere. Many famous
Android APPs, such as Angry Birds, Fishing Joy, have been
exploited as hosts and spread medias of mobile malware such
as trojans, backdoors, adware, SMS abusers, etc. Repackaged
APKs can be grouped by types based on its maliciousness, or
by families based on code initiatives and inheritance. How to
identify such repackaged APKs, locate its code change and
trace back its inheritance is a critical problem to solve in
computer software and security research. Traditional code
static analysis techniques and characterization by code digest
such MD5 values fade its effectiveness when applying on
Android platform, where application code can be easily
decompiled, modified, and obfuscated.
Android applications have to call APIs for system-level
functionalities to visit sensitive data, access system resources,
obtain device state and control device hardware. The API calls
and it sequence reflects the corresponding software operations
and behaviors. By analyzing the use of Android system APIs,
we are able to infer the application behaviors and risk during
the code execution. In this paper, we propose a static analysis
method on API calls within code structure, context and
execution. The code organization on class and function is used
to record and track the system API usage, which is further used
to describe and characterize APK code identification. For each
APK, we build a three-layer description tree structure using the
API calls within packet, on object classes, and class functions.
Code similarity is computed on each layer precisely and
propagated back to the upper layer to aggregate the summary
view. This ultimate achieve a precise description and inference
of the APK's behavioral characteristics which can be used to
compare and judge whether any two APKs are of the same
functionalities (types) and code base (code family).
Compute and Network security professional have invest
intensive efforts in Android malware detection and
characterization. Their approaches can be roughly classified
into two categories, static and dynamic methods. Static analysis
involves examination of Android APK manifest file for
permission declaration, code segmentation and hashing,
analysis of data flow and dependencies, object and function
call relations and nesting. Using MD5 hashing to extract
known malware features, as applied by PC virus scanning, is
also firstly and widely used for Android malware detection [2].
Since Android API permission usage is explicitly declared in
manifest file, and individual permission carries risk of different
API execution, researchers also tried to evaluate the risk factor
of each individual permission request, and accumulate total the
risk of running the application [3]. More complicated method
connects permission declaration with API call instance and
statistically analysis their distribution to summarize and
conclude risk factors [4][5]. At the dynamic analysis side,
people normally surveil from mobile applications Internet
This work was supported in part by the China NSF grant 61472189, and the NUST Purple Star Scholarship for Yong Researchers.