Contents lists available at ScienceDirect
Neurocomputing
journa l homepa ge: www.elsevier.com/locate/neucom
A cost-sensitive rotation forest algorithm for gene expression data
classification
Huijuan Lu
a
, Lei Yang
a
, Ke Yan
a,
⁎
, Yu Xue
b
, Zhigang Gao
c
a
College of Information Engineering, China Jiliang University, 258 Xueyuan Street, Hangzhou 310018, China
b
Nanjing University of Information Science & Technology, Nanjing 210044, China
c
College of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
ARTICLE INFO
Keywords:
Gene expression data
Rotation forest
Cost-sensitive
Misclassification cost
Rejection cost
Test cost
ABSTRACT
Existing works show that the rotation forest algorithm has competitive performance in terms of classification
accuracy for gene expression data. However, most existing works only focus on the classification accuracy and
neglect the classification costs. In this study, we propose a cost-sensitive rotation forest algorithm for gene
expression data classification. Three classification costs, namely misclassification cost, test cost and rejection
cost, are embedded into the rotation forest algorithm. This extension of the rotation forest algorithm is named
as cost-sensitive rotation forest algorithm. Experimental results show that the cost-sensitive rotation forest
algorithms effectively reduce the classification cost and make the classification result more reliable.
1. Introduction
The increasingly polluted environment makes the cancer become
the most common fatal disease world-widely for the current century
[1]. The fast development of the Internet and database management
technologies makes the automated cancer diagnosis possible [2].In
bioinformatics, various data mining and machine learning techniques
are proposed to assist the cancer diagnosis in the molecular level [3–7].
With the discovery of the DNA microarray, biologists believe that the
classification of gene expression data provides important information
in cancer diagnosis [8–13]. The current research of the gene expression
data classification problem focuses on the classification accuracy, the
generalization ability, the complexity and understandability of the
algorithm, the stability of the classifiers and etc. However, it is usually
difficult for traditional classifiers to achieve high and stable classifica-
tion results due to the three difficulties of the gene expression data
classification, namely high dimension, imbalanced noisy data and small
sample size [14]. Besides the classification accuracy, considering
classification cost is also desired for gene expression data classification
problems [15]. In this work, we take the classification cost into
consideration and introduce a series of cost-sensitive learning algo-
rithms to overcome the difficulties of the gene expression data
classification.
Machine learning techniques, such as neural networks [16–18] and
support vector machine [19], are widely used in gene expression data
classification problems, medical diagnostic analysis, industrial data
analysis and etc., because of the high classification accuracy. Decision
tree (DT) is a conventional machine learning model, which generally
presents a tree structure, and can be rewritten by a set of ‘if-else’ rules.
Each branch of DT represents a class of sample with common
characteristics in the feature space. There are many extensions of the
DT algorithm, such as EG2, ID3, C4.5, CART and etc. [20].
Random forest (RF) [21] is an ensemble classifier that consists of
multiple DTs through a random splitting of the feature space. The
rotation forest (RoF) is developed based on RF by assembling multiple
DTs [22]
. It segments the feature space into subspaces, extracts the
most
important features from each subset and repeats the process to
obtain the most distinguishable training dataset and basic classifiers
for different feature subspaces. Recently, Hosseinzadeh and Eftekhari
[23] proposed a high performance RoF for imbalance data classifica-
tion by adding fuzzy C-mean clustering method into the RoF classifica-
tion process. Fang et al. [24] utilized two improved RoF algorithms to
classify highly imbalanced data. Experiments showed satisfactory
results on the most widely used imbalanced measure criterion AUC.
The classification costs mainly consist of three types, namely
misclassification cost, test cost and rejection cost [25]. Hybrid learning
methods extend the conventional classifiers by embedding new factors
into the algorithms [26,27]. By embedding the classification cost into
the traditional classifiers, many cost-sensitive classification models are
proposed [28]. Zidelmal et al. [29] embedded classification cost into
the support vector machine (SVM) to classify the ECG beat and
achieved average accuracy of 97.2% with no rejection and 98.8% for
http://dx.doi.org/10.1016/j.neucom.2016.09.077
Received 16 May 2016; Received in revised form 25 July 2016; Accepted 5 September 2016
⁎
Corresponding author.
E-mail address: yanke@cjlu.edu.cn (K. Yan).
Neurocomputing 228 (2017) 270–276
Available online 02 November 2016
0925-2312/ © 2016 Elsevier B.V. All rights reserved.
MARK