CUSBoost: Cluster-based Under-sampling with
Boosting for Imbalanced Classification
Farshid Rayhan, Sajid Ahmed, Asif Mahbub, Md. Rafsan Jani,
Swakkhar Shatabda, and Dewan Md. Farid
Department of Computer Science & Engineering, United International University, Bangladesh
Email: dewanfarid@cse.uiu.ac.bd
Abstract—Class imbalance classification is a challenging re-
search problem in data mining and machine learning, as most
of the real-life datasets are often imbalanced in nature. Existing
learning algorithms maximise the classification accuracy by cor-
rectly classifying the majority class, but misclassify the minority
class. However, the minority class instances are representing the
concept with greater interest than the majority class instances
in real-life applications. Recently, several techniques based on
sampling methods (under-sampling of the majority class and over-
sampling the minority class), cost-sensitive learning methods, and
ensemble learning have been used in the literature for classifying
imbalanced datasets. In this paper, we introduce a new clustering-
based under-sampling approach with boosting (AdaBoost) algo-
rithm, called CUSBoost, for effective imbalanced classification.
The proposed algorithm provides an alternative to RUSBoost
(random under-sampling with AdaBoost) and SMOTEBoost (syn-
thetic minority over-sampling with AdaBoost) algorithms. We
evaluated the performance of CUSBoost algorithm with the state-
of-the-art methods based on ensemble learning like AdaBoost,
RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class
datasets with various imbalance ratios. The experimental results
show that the CUSBoost is a promising and effective approach
for dealing with highly imbalanced datasets.
Keywords—Boosting; Class imbalance; Clustering; Ensemble
classifier; Sampling; RUSBoost
I. INTRODUCTION
In machine learning (ML) for data mining (DM) appli-
cations, supervised learning (or classification) is the process
of identifying new/ unknown instances employing classifiers
(or classification algorithms) based on a group of instances
with known class membership (training data) [1], [2], [3], [4].
Often real-world data sets are multi-class, high-dimensional
and class-imbalanced, which fall-off the classification accuracy
of many ML algorithms. Therefore, number of ensemble
classifiers with sampling techniques have been proposed for
classifying binary-class low-dimensional imbalanced data in
the last decade [5], [6], [7]. Ensemble classifiers use multiple
ML algorithms to improve the performance of individual clas-
sifiers that combine multiple hypotheses to form an advance
hypothesis [3]. The sampling methods use under-sampling
(under-sampling of the majority class instances) and over-
sampling (over-sampling the minority class instances) tech-
niques to alter the original class distribution of imbalanced
data. The under-sampling methods with random sampling of
the majority class might suffer from the loss of potentially
useful training instances. On the other hand, over-sampling
with replacement doesn’t significantly improve minority class
recognition and increase the likelihood of overfitting [8].
In real-world class imbalance data sets, the minority class
instances are outnumbered by the majority class instances.
However, the minority class instances are representing the
concept with greater interest than the majority class instances
[9]. The traditional ML for DM algorithms, such as decision
tree (DT) [1], [3], na
¨
ıve Bayes (NB) classifier [2], and k-
nearest neighbors (kNN) [1], build the classification models
that maximise the classification rate, but ignore the minor-
ity class. The most approved methods for dealing with the
class imbalance problems are sampling techniques, ensemble
methods, and cost-sensitive learning methods. The sampling
techniques (under-sampling and over-sampling) either remove
the majority class instances from the imbalanced data or add
the minority class instances into the imbalanced data to get
the balanced data. The ensemble methods such as Bagging
and Boosting are also widely used for classifying imbalanced
data. Basically, the ensemble methods use sampling technique
in each iteration. The cost-sensitive learning is also applied for
solving the class imbalance problems, which assigns different
cost of misclassification errors for different classes. Usually,
high cost for the minority class and low cost for the majority
class. But, the classification results are not stable in cost-
sensitive learning methods, as it is difficult to get the accurate
misclassification cost and different misclassification cost might
result in different inductions.
The methods for dealing with class imbalance problems
are divided into two categories: (a) external methods and (b)
internal methods. The external methods are also known as
data balancing methods, which preprocess the imbalanced data
to get the balanced data. The internal methods modify the
existing learning algorithms for reducing their sensitiveness
to the class imbalance when learning from the imbalanced
data. In this paper, we present a new clustering-based under-
sampling approach with boosting (AdaBoost), called CUS-
Boost algorithm. We divide the imbalanced dataset into two
part: majority class instances and minority class instances.
Then, we cluster the majority class instances into several
clusters using k-means clustering algorithm and select the
majority class instances from each cluster to form a balanced
dataset, where the majority and minority class instances are
almost equal. Clustering helps us to group the majority class
instances in such a way that instances in the same cluster are
more similar to each other than to those in other clusters. So,
instead of randomly removing the majority class instances we
used clustering technique to select the majority class instances.
CUSBoost combines the sampling and boosting methods to
form an efficient and effective algorithm for class imbalance
learning. We tested the performance of CUSBoost algorithm
IEEE 1 | P a g e
arXiv:1712.04356v1 [cs.LG] 12 Dec 2017