Contaminant Removal for Android Malware Detection Systems
Lichao Sun
∗
, Xiaokai Wei
†
, Jiawei Zhang
‡
, Lifang He
§
, Philip S. Yu
∗
and Witawas Srisa-an
¶
∗
University of Illinois at Chicago, Chicago, IL
†
Facebook, Menlo Park, CA
‡
IFM Lab, Florida State University, FL
§
Cornell University, New York City, NY
¶
University of Nebraska - Lincoln, Lincoln, NE
Email: {lsun29, xwei2, psyu}@uic.edu, {jwzhanggy, lifanghescut}@gmail.com, witty@cse.unl.edu
Abstract—A recent report indicates that there is a new
malicious app introduced every 4 seconds. This rapid malware
distribution rate causes existing malware detection systems
to fall far behind, allowing malicious apps to escape vetting
efforts and be distributed by even legitimate app stores. When
trusted downloading sites distribute malware, several negative
consequences ensue. First, the popularity of these sites would
allow such malicious apps to quickly and widely infect devices.
Second, analysts and researchers who rely on machine learning
based detection techniques may also download these apps and
mistakenly label them as benign since they have not been
disclosed as malware. These apps are then used as part of
their benign dataset during model training and testing. The
presence of contaminants in benign dataset can compromise the
effectiveness and accuracy of their detection and classification
techniques.
To address this issue, we introduce PUDROID (Positive
and Unlabeled learning-based malware detection for Android)
to automatically and effectively remove contaminants from
training datasets, allowing machine learning based malware
classifiers and detectors to be more effective and accurate.
To further improve the performance of such detectors, we
apply a feature selection strategy to select pertinent features
from a variety of features. We then compare the detection
rates and accuracy of detection systems using two datasets;
one using PUDROID to remove contaminants and the other
without removing contaminants. The results indicate that once
we remove contaminants from the datasets, we can significantly
improve both malware detection rate and detection accuracy.
Keywords-Mobile Security; Malware Detection; Noise Detec-
tion; Android Malware; PU Learning;
I. INTRODUCTION
Android is currently the most used smart-mobile device
platform in the world, occupying 87.6% of market share
and over 1.4 billion Android devices in deployment [1].
Unfortunately, the popularity of Android also makes it a pop-
ular target for cyber-criminals to create malicious apps that
can steal sensitive information and compromise systems [2].
During the first three months of 2016, Kaspersky Lab
uncovered over 2 million malware samples including trojans,
worms, exploits, and viruses. On average, a malicious app
is introduced in every 3.79 seconds [3]. Some types of
malicious apps have more than 50 variants, making detecting
all of them very challenging [4].
There have been several approaches to detect these mali-
cious Android apps. Most approaches focus on the attack be-
haviors, and use static or dynamic analysis to build detection
Dataset w/o Malicious Contaminants
Dataset with Malicious Contaminants
: Malware : Benign App
: Malicious Contaminants
: Hyperplane
Figure 1. Left figure shows machine learning can classify malware
and benign apps well without malicious contaminants. Right figure shows
that the machine learning cannot work well for malware detection with
malicious contaminants
tools that rely on approaches known to work well for desktop
environments [5]. However, static analysis approaches in
general can produce a large number of false positives while
dynamic analysis approaches need adequate input suites to
sufficiently exercise execution paths. Therefore, neither of
them will work well for Andriod malicious app detection.
Another emerging approach is to build detection techniques
based on data mining and machine learning techniques [6],
[7], [8].
For example, DREBIN [6] utilizes multi-view features
by combining static analysis and supervised learning to
accurately detect malware. SIGPID [7] improves upon
DREBIN [6] by using many more features for training and
detection. DROIDCLASSIFIER [8] uses traffic flow informa-
tion and unsupervised learning to detect the malware and
classify the family of each malicious app.
When machine learning techniques are used to help with
malware detection, the detection effectiveness and accu-
racy are highly dependent on the quality of the training
datasets. To create such dataset, researchers typically label
a set of malicious apps and a set of benign apps. To
build the malicious dataset, researchers manually label these
malicious apps one by one based on known information
from various malware analysis and collection sources (e.g.,
virusshare.com). To build the benign dataset, researchers
download apps from trusted distribution sources such as
Play Store and verify that those apps have not been recently
disclosed as malware. However, as previously mentioned,
arXiv:1711.02715v2 [cs.CR] 14 Nov 2017