Hybrid Concentration Based Feature Extraction Approach for Malware
Detection
Pengtao Zhang and Ying Tan, Senior Member, IEEE
Abstract— In this paper, a hybrid concentration based feature
extraction (HCFE) approach is proposed. The HCFE approach
extracts the hybrid concentration (HC) of a sample in both
the global resolution and the local resolution. The HC of a
sample characterizes the sample more precisely and completely
by taking the global information and local information into
account at the same time. With the help of the co-operation of
the global and local information, the HC discards the bias of
the global concentration (GC) to the global information and the
local concentration (LC) to the local information, respectively.
In order to incorporate the HCFE approach into the procedure
of malware detection, a HC-based malware detection (HCMD)
method is proposed. Eight groups of experiments on three pub-
lic malware datasets are exploited to evaluate the effectiveness
of the HCMD method using cross validation. Comprehensive
experimental results suggest that the HC of a sample extracted
by the HCFE approach characterizes the sample more precisely
and completely than the GC and LC. The proposed HCMD
method outperforms the GC-based and the LC-based malware
detection methods in all the experiments for about 1.05% and
0.28% on average, respectively.
I. INTRODUCTION
Malware is a general term for all the malicious code that
is a program designed to harm or secretly access a computer
system without the owners’ informed consent [1]. According
to the malware’s method of operation, the malware can
be roughly broken down into several categories, such as
computer virus, Trojan horse and worm. Some adware is
also regarded as malware. The malware costs hundreds of
millions of dollars every year all over the world. It has
been one of the most terrible threats to the security of the
computers worldwide [2].
To address the problem of malware detection, a variety
of malware detection methods have been proposed, while
various commercial anti-malware products are available in
the market. These anti-malware solutions can be classified
into two categories: static methods and dynamic methods.
The static methods attempt to detect malware without actu-
ally running any code. They are mainly based on machine
learning and data mining methods, and heuristic theories
(such as artificial immune theory [3][4]). The static methods
usually work on the binary string or application programming
interface (API) calls of a program, so they are portable
and can be deployed on personal computers. The dynamic
methods keep watch over the execution of every program
Y. Tan is the correspondent author with the Department of Machine
Intelligence, School of Electronics Engineering and Computer Science,
Peking University, Beijing, 100871, China. E-mail: ytan@pku.edu.cn.
P.T. Zhang is a PhD candidate with the Department of Machine Intel-
ligence, School of Electronics Engineering and Computer Science, Peking
University, Beijing, 100871, China. E-mail: pengtaozhang@gmail.com.
during run-time, observe its behavior, and stop it once it
tries to harm the system, such as behavior blockers, virtual
machines. The dynamic methods bring too much extra load.
Hence they are usually used to analyze malware in the
computer security firms instead of to detect malware in
personal computers.
Inspired by human immune system, the immune con-
centration has been proposed as an effective feature [5].
There are two concentration based features so far : the
global concentration (GC) and the local concentration (LC).
The GC was proposed firstly for spam detection [5][6] and
later applied to detect malware [7]. Although the GC-based
methods perform very well in the two problems, the GC
merely contains the global information of a sample extracted
in the global resolution. This design results in its bias to the
global information, ignoring the local information, and a high
diluent risk. To overcome the diluent risk of the GC, the LC
was proposed [8][9]. The LC zooms out the concentration
information and stores the position-correlated information
implicitly by defining a local area. However, the LC ignores
the global information and merely characterizes a sample
from the perspective of a local resolution, resulting in its
bias to the local information. Furthermore, the stability of the
position-correlated information should be under suspicion.
How to design and extract a discriminating immune concen-
tration based feature, discarding the bias of the GC and LC
to the global information and local information, respectively,
becomes a worthwhile work.
In this paper, a hybrid concentration based feature ex-
traction approach is proposed by taking inspiration from
the GC and LC. The HCFE approach extracts the hybrid
concentration (HC) of a sample in both the global resolution
and the local resolution. The HC of a sample characterizes
the sample more precisely and completely by taking the
global and local information into account at the same time. It
discards the bias of the GC and LC, respectively, to the global
information and local information. In order to incorporate the
HCFE approach into the procedure of malware detection, a
HC-based malware detection (HCMD) method is proposed.
Extensive experimental results demonstrate that the pro-
posed HCMD method is effective to detect unseen malware.
It outperforms the GC-based and LC-based malware detec-
tion methods in the eight groups of experiments on the three
malware datasets for about 1.08% and 0.28% on average,
respectively.
The rest of the paper is organized as follows. In Section
II, we introduce the related work. In Section III, we give the
definition of the HC and describe the HCFE approach in de-
978-1-4799-5829-0/15/$31.00 ©2015 IEEE
Proceeding of the IEEE 28th
Canadian Conference on Electrical and Computer Engineering
Halifax, Canada, May 3-6, 2015
140