
D.S. Huang, X.-P. Zhang, G.-B. Huang (Eds.): ICIC 2005, Part I, LNCS 3644, pp. 878
–
887, 2005.
© Springer-Verlag Berlin Heidelberg 2005
Borderline-SMOTE: A New Over-Sampling Method in
Imbalanced Data Sets Learning
Hui Han
1
, Wen-Yuan Wang
1
, and Bing-Huan Mao
2
1
Department of Automation, Tsinghua University, Beijing 100084, P. R. China
hanh01@mails.tsinghua.edu.cn
wwy-dau@mail.tsinghua.edu.cn
2
Department of Statistics, Central University of Finance and Economics,
Beijing 100081, P. R. China
maobinghuan@yahoo.com
Abstract. In recent years, mining with imbalanced data sets receives more and
more attentions in both theoretical and practical aspects. This paper introduces
the importance of imbalanced data sets and their broad application domains in
data mining, and then summarizes the evaluation metrics and the existing meth-
ods to evaluate and solve the imbalance problem. Synthetic minority over-
sampling technique (SMOTE) is one of the over-sampling methods addressing
this problem. Based on SMOTE method, this paper presents two new minority
over-sampling methods, borderline-SMOTE1 and borderline-SMOTE2, in
which only the minority examples near the borderline are over-sampled. For the
minority class, experiments show that our approaches achieve better TP rate
and F-value than SMOTE and random over-sampling methods.
1 Introduction
There may be two kinds of imbalances in a data set. One is between-class imbalance,
in which case some classes have much more examples than others [1]. The other is
within-class imbalance, in which case some subsets of one class have much fewer
examples than other subsets of the same class [2]. By convention, in imbalanced data
sets, we call the classes having more examples the majority classes and the ones hav-
ing fewer examples the minority classes.
The problem of imbalance has got more and more emphasis in recent years. Imbal-
anced data sets exists in many real-world domains, such as spotting unreliable tele-
communication customers [3], detection of oil spills in satellite radar images [4],
learning word pronunciations [5], text classification [6], detection of fraudulent tele-
phone calls [7], information retrieval and filtering tasks [8], and so on. In these do-
mains, what we are really interested in is the minority class other than the majority
class. Thus, we need a fairly high prediction for the minority class. However, the
traditional data mining algorithms behaves undesirable in the instance of imbalanced
data sets, as the distribution of the data sets is not taken into consideration when these
algorithms are designed.
The structure of this paper is organized as follows. Section 2 gives a brief introduc-
tion to the recent developments in the domains of imbalanced data sets. Section 3
评论1