4662 Xu et al.: Feature Selection to Mine Joint Features from High-dimension Space for Android Malware Detection
DroidMat. Lastly, kNN was used to classify the applications as benign or malicious
applications. DroidMat had the better recall rate and high efficiency. Grace et al. [26]
developed RiskRanker to analyze whether a particular application exhibited malicious
behaviors based on control-flow and data-flow. It was helpful to analyze the encrypted native
code and unsafe Dalvik codes. The results showed RiskRanker had the high efficacy and
scalability to detect the zero-day malware. Yang et al. [27] proposed AppContext to construct
a self-defined call graph based on the context of a security-sensitive behavior. Then
AppContext detected the malware by classifying the security sensitive behaviors based on the
extracted contexts.
Permissions and APIs are proved to be useful features to detect malware. So we also extract
them from the decompiled files. Moreover, the extracted features include not only the
Android permissions and Android SDK APIs but also the user-defined permissions and
third-party APIs. In the previous research, user-defined permissions and third-party classes are
never analyzed for malware detection. This is our first difference compared with other papers.
Feature selection is a crucial step in data processing. It chooses a best feature subset from
the whole feature set based on the correlation between features and classes [28]. The selected
features should contain the least features that have great impact on the performance of
malware detection. Experiment results indicate that feature selection is useful for Android
malware detection [5]. IG algorithm is widely used for feature selection based on the entropy
difference [29]. [30] collected 2285 Android applications and extracted more than 9898
features. Then Chi-square (CHI), Fisher Score (FS), and IG methods were used to choose the
top 50, 100, 200, 300, 500 and 800 features. Cen et al. [5] used IG and CHI for feature
selection. They list the top 20 functions selected by IG and plotted a curve to show the
performance of different ratio of the selected feature by IG and CHI. Experiment results show
that IG and CHI were useful methods to select the best feature subset.
In our work, we mainly focus on the feature selection methods,including IG, PSO and
-norm regularization. They have been successfully applied to solve a large number of
applications and difficult optimization problems [31-33]. [31] applied PSO to find the optimal
feature subset, in which particle swarms found the best feature combinations when they flied
within the subset space. [32] used PSO to accomplish multi-objective feature selection, whose
goals were to maximize the classification performance and to minimize the number of features.
[33] introduced a robust loss function, called Brownboost loss, which computed the feature
quality and selectd the optimal feature subset to enhance robustness. [8] used
-norm on the
projection matrix to achieve row-sparsity, which led to select the relevant features and learn
the transformation simultaneously. Anyhow, feature selection is a meaningful data processing
technology, which can minimize the classification error rate with the least number of features.
So we use the feature selection to mine the joint features, and maximize the classification
performance.
3. Methodology
The structure and process of Android malware detection based on the feature selection to mine
joint features are depicted in Fig. 1, which consists of four major parts. The first one is reverse
engineering, which decompiles APK files to the readable source code files, including
AndroidManifest.xml and .smali files. The second part is feature extraction, which extracts
features from the source code files. Then each application is represented as a single binary
instance with permission and API features. Class label indicates whether the application is